Model Evaluation Metrics

I’m curious to understand what the best evaluation metrics are for regression models on the Numerai dataset and problem. The potential ones I’m aware of are (by eras):

  • Spearman Rank correlation coefficient (Monotonic)
  • Pearson correlation coefficient (Linear)
  • Sharpe (mean correlation over StdDev of correlation)
  • Feature Exposure (how does one calculate this?)
  • MAE?
  • MSE/RMSE?
  • R^2?
  • Max Drawdown length & depth? (not sure how to do this)

Any other key ones I’m missing? Do some make more/less sense than others? What about for a multi-class classification model?

Basically I’m using the XGBoost example model (@integration_test) as my baseline, and I’d like to know which metrics to use to compare my model vs the baseline on training & validation. Any/all feedback appreciated!

1 Like

I would say to keep in mind you want to optimize your payment. You are evaluated on the Spearman Rank Correlation between your predictions and the live data. So train your models with whatever metric you want, but in the end, check how well they do on spearman :-).

For calculating feature exposure:
Calculate the pearson correlation between feature 1 and your predictions, between feature 2 and your predictions, … …, between feature 310 and your predictions.

Take the standard deviation over that list of pearson correlations and you have your feature exposure.

3 Likes

Mean Absolute Percentage Error (MAPE) could be added to the list, (although I have never found it useful). Another one that I have heard of that wouldn’t work for this competition is the number of days that your algorithm had new all-time-highs (in contrast to minimizing drawdown—a proxy for this could be a cumprod() of validation era correlation values clipped at 0.2 and normalized as percentages to give a rough estimate of portfolio growth).

It is anecdotal to my own workflow but I find that any time I deviate from MSE I keep returning to MSE. Right now my neural network gradient descent optimizes for MSE and then I usually visually note the sharpe value of the model while doing a lot of baby-sitting to find a model that I like (https://numer.ai/kaleidoscopekloud). I need to be more procedural about the process.

Thanks bor for explaining the feature exposure!

1 Like

Take the standard deviation over that list of pearson correlations and you have your feature exposure.

I think it is better to calculate the norm of the whole correlation vector: sqrt(a^2 + b^2 + c^2 + ...) instead of stddev

1 Like

I also use logcosh as a loss function for training. Nice overview of loss functions here: https://medium.com/@phuctrt/loss-functions-why-what-where-or-when-189815343d3f

For optimization I also use smart sharpe ratio and smart sortino ratio (sharpe/sortino ratio with an autocorrelation penalty). These were discussed in another thread of this forum.

2 Likes

Awesome, thanks guys! Think I’ll go with these ones, for now anyway:

  • Spearman rank
  • Smart Sharpe
  • Smart Sortino
  • Feature Exposure
  • MSE

Found the post that discusses Smart Sharpe, Sortino etc.

Hey Joakim, a prediction problem in machine learning usually is reduced to an optimization problem. So we need to minimize or maximize a function; for example, in case of regression problems, RMSE, MAE, or R^2 are very popular. We can define objective function as a function that has first (Gradient) and second (Hessian) derivatives, whereas a metric function does not need to be differentiable. We need the objective to be differentiable, so algorithms like gradient boosting (hence the name!) and neural nets (for backpropagation) or even simple linear regression could be trained. In your list of metrics, only RMSE and MSE are differentiable (there are some proxy functions for some metrics like MAE that can be used as an objective) and the rest can only be used as metrics.

Here in this tournament, the choice of metric is predefined (Spearman Rank Correlation), so we are solving a ranking problem. Our selection of the objective function to be minimized or maximized by our algorithms is an open problem that should be addressed. To pick a proper objective function, first, we need to choose a validation scheme that we trust like k-fold cross-validation, time-split validation, adversarial validation, etc. This is a simple but essential step; without appropriate validation, all our efforts are useless! After that, we can try a list of objective functions to see if one of them improves our validation Spearman Rank Correlation score or not compared to the rest.

As for trying multiclass classification, after we set up our validation we can validate our ideas regarding multiclass classification (for example, ordinal multiclass classification with logistic regression).

Note: We can define our custom objective function in XGBoost or LightGBM easily (we can set our proxy Gradient and Hessian and feed them to the algorithm).

5 Likes

Thanks @Kainsama, I find this SUPER helpful!!

I must add that it is possible to build differentiable versions or near equivalents of ranking functions (Spearman’s rank correlation coefficient, Pearson correlation coefficient etc) and let the optimization algorithm directly optimize for it.

4 Likes

Yep, You are right. I’ve personally never seen many people do that but sure it is a possibilty. Here’s discussion about appiximating MAE with a differentiable proxy function if anyone is interested: https://www.kaggle.com/c/allstate-claims-severity/discussion/24520

2 Likes

Thank you for this information!!! :raised_hand_with_fingers_splayed: