Which Model is Better?

I am curious what others think, I have two models. One where I optimized for correlation, and the other I optimized for risk. Based on the validation numbers, which one would you prefer to stake? I didn’t do the cross validation, so I do not have that info on hand as I see that seems more popular on this forum.

I like how the max draw down is not that much over two weeks in the second one. On the other hand, I am only putting a small percentage of my assets on these models so perhaps I should go for the higher risk?

Thanks for feedback

I’d stake 'em both. Then I’d keep generating more, hopefully with significantly lower corr to examples.


Sorry, I don’t understand. What do you mean by “significantly lower corr to examples”?

Thanks for the feedback

meta model corr (ex preds corr here) – you’ve got .71 & .66. In other words, I’d be trying to make some models that are both good and more original. That’s going to be more valuable in the future.

Yes I have been reading about some of these things briefly, but didn’t really understand. Wanted to try to put something on the board first.

Then perhaps will want to try to work on Signals and MMC. FNC might also be coming to stake on?

1 Like

what type of model is it? Tree, NN, something else?

Standard tree like everything else. Not too different from examples they gave honestly. I want to try some more advanced stuff later, but will probably be a while til I get to it.

how long have the models been running? whats the corrmmc for these models?

As mentioned by most experienced participants, don’t put too much focus on the validation set. A cross-validation score is a much better indicator for future performance. You can start with the advanced example script.


Sorry I don’t know what corrmmc is for the model. I just started.

For the cross validation, do you put the training and validation data together and cross validation, say k=5 folds? I was thinking similarly, but often find the training/test split is just as good (in other applications) as cross validation and doesn’t take as long to run. Mostly because of the correlation of the models due to them sharing a lot of the same data.

One thing I do not like is using the same validation set all the time. I usually avoid this by changing a random seed number every now and then (and if I really want a CV esque like result, do 100 random seeds), but it’s not as simple here. Need to worry about overlapping eras and I really don’t understand the gap between the train and validation eras. Feel I should maintain that as it was given. /shrug

Well, are there models out there on this forum that show the CV result, the validation result, and the actual future results? I also remember seeing the more experienced participants say such things, but did not see the evidence

You are right that not many ppl show their CV result. Maybe it is because there are multiple ways to do CV and the validation periods is not the same for everyone. But it is quite obvious and common sense that CV score is more important than validation set as in CV you basically test your model on multiple CV test set and CV validation set (depends on how to select your fold the validation set can be a part of your CV fold, so your canonical validation set is just one part of your CV scores).

1 Like

Thanks maxchu, yes I believe it was these posts that I saw. It’s crazy how poorly they performed but still earned some medals in live rounds. I also did want to start including the validation data in my training examples as well and just trust the model fit on 20% more data.

I know CV sounds pretty cool. All the DS people I know prefer it to training test split. You get to use all the data points twice, once for training and another for validation. I don’t feel it’s worth the Kx time in training for k fold CV. If so, I think seeing some analysis that shows training/test split is bias but CV is not would change my mind. Or if there’s some other argument out there besides using all the data twice.

And if you want to understand the volatility of your metrics or you feel you are abusing the validation set too much or you feel the random draw of the validation set is off, then I’ll just find another validation set or deeply bootstrap some things instead. That is an altogether different idea though

I come from a background of deep learning research and CV is not a common practice due to the nature of the problem where deep learning does well. Usually, problems with lots of data require lots of computing, and in that case, CV is not feasible. But for numerai data, it is completely different from the usual deep learning dataset. Signal to noise is very high, data set is non-iid. I would say in a problem like numerai, CV is the most important thing. All I can say is that you will learn your lesson in the future if you don’t use CV. I actually think that numerai should stop labeling the canonical validation set and simply tell users to do their own validation split, ideally in CV.


Also, it takes a long time only if you are training deep NN. But from my own experience and experience shared by other users, you dont need a very deep NN to have good results (checkout the autoencoder post about Jane Street Kaggle competition), i actually think it will hurt if your model is too deep. If you train a shallow autoencoder-like network, it is actually much faster than tree-models as it can easily fully utilize multiple GPUs, you can train multiple targets at once. I also like NN as it is more flexible and you can model very complex ideas easily (for example like global era-wise feature, multiple output head, … etc).

Thanks for your feedback. I have learned 4 PhD classes in Stats and they all more or less gloss over this point. When asked about it, no proof is given, just general heuristics that CV uses data more efficiently because the points are used for training and testing. I have also tried looking this up in papers too. I guess most people accept it is true so I don’t find what I’m looking for.

My also not rigorous reasoning is this. Let k is 4. One train test split is just one fold of the k == 4 CV. Have you ever seen one fold do so drastically worse than 4 folds most of the time? And then if the model did accidentally generalized well on the one fold, I would expect another model fit with 66% same data/33% test data would also by chance generalize well to the new held out (old training) data. If the dataset is small, say 4 observations, I can see how only looking at one fold would be too volatile by outlier chance. Not so much for millions though.

Do people see that in their CV results? If you accidentally just used one of your folds, would you have made a mistake according to what you find looking at all your folds aggregated in some way? I think that would be interesting to see and would definitely convince me to use more CV. I will probably even try myself for curiosity one day, but in all my other projects, I have not seen it besides in the very small sample case. As such, I find the extra amount of modeling fitting not very worthwhile.

I totally agree though, I’ve been using this one validation set too much. And there is a question if there is some bias in the validation set. For reasons like you mention if there is a very strong non iid correlation and it persists through the 100 validation eras. I do think it’s time to train/test a different group of data. But I also don’t hyperparameter tune my models very precisely, only to general heuristics after seeing how the model/data behaves.

I do have two questions for those who use CV. I was thinking to block split the eras because they are overlapping (not iid). So eras 1-4 go to the same fold. But I didn’t know if era 1 coincided with week 1 of a month. Or should I be block splitting eras 2-5 if era 1 was the 4th week of another month? Second, how do you deal with multiple observations of the same firm throughout all the folds in your dataset? Surely the data is also not IID in this way too. I assume the given validation period is far away from the training period to not have this issue (but I have no idea if the gap in the eras actually mean anything). I also found my validation data taken from the training period to also be overly optimistic compared to the given validation data. I assumed it was for this reason.

I will first answer your first question as i found it a little bit confusing for me. When you do the CV, your K-fold validation sets should cover most of your original train+val set. In the simplest case, you can just use the average of all corr of all folds as your CV score. Say you have N models (different methods or just different hyperparameters), each model will have its own CV score. You can just pick one with the highest score. In this example, the risk of overfitting is much less compared to just using the canonical validation set provided by numerai because your selected model performs the best among all validation sets rather than just one. Does that make sense to you?

Yes, I understand what you are mentioning. But the risk is in expectation. For the CV to make a material difference from the training test split, the sample sizes would need to be small.

I would think a sufficiently large training test split would be approximately close enough. But I will test it myself. I’ll bootstrap the metrics with various 80/20 splits and see it’s volatility. Perhaps for reasons you state, low signal to noise ratio and just crazy volatility and non iid observations in both time and asset, so millions of observations are not sufficient.

It is possible, but my prior belief is that it is unlikely in this circumstance. Regardless, thank you for this discussion. I can report back results from the bootstrap later this week.

So, correct me if I am wrong, you are saying that the sample size for numerai is too large so CV will not be worth it as it wastes a lot of compute?