Neural Nets are all you need. Really?

correlator’s question on Neural nets in RocketChat

On the Numerai data, Tree based models are easier to develop than Neural nets as the latter requires more finetuning. I tried a live model (NN) for a some weeks and then gave up as it was consistently below 0 corr. Assuming that most of the tournament models are tree based models (I am pretty sure they are), it will help Numerai if people build other kinds of models such as Neural nets.
It would help if someone could give pointers to building a first cut NN model which works at par with the example model. @mdo @jrb @surajp

I thought why not answer it on forum as we can discuss on this broadly here.

I think there are a lot of Neural Nets in the tournament.

First thing to consider when trying out Neural nets for modelling is the fact that they are called “Universal function approximators”. You can perform all sorts fancy experiments with it. A sufficiently parameterized model will eventually over-fit on the training data.

You can approximate (or even distillate) your best of tree based models by a sufficiently parameterized neural net. Which implies that NNs are capable of learning similar patterns as Trees. You just need to find an appropriate architecture for the data.

Next step is to incorporate correlated variety (the closest word I could think of) if you are considering ensembling. With that, you now have a whole new pallet of choices. You can ensemble on different architectures, initialization, training on different subsets of eras and what not!
You need a combination of models that can generalize well when combined OR Instead of ensembling, you can learn another model that is uncorrelated to all of these (this also applies to learning an orthogonal model to your best Tree based model) (“Beating the wisdom of the crowds is harder than recognizing faces or driving cars”). You need to give a Boost to your models :deciduous_tree:.

  • I wasn’t a big fan of ensembles of big models in production because of resource constraints but turns out we can reduce the size and inference time by pruning and distillation without sacrificing much of the original model’s performance. Which can somehow reduce overfitting.

  • Deep Ensembles: A Loss Landscape Perspective This paper changed my perspective on ensembling NNs! (You might get some ideas from here too)

The most important thing is choosing a loss function! There is a lot of discussion on loss functions on the forum. This is where NNs shines (pretraining => finetuning). Remember, predictions are scored on correlation. You should develop your own loss function to get better on MMC (that’s your secret sauce)!


  • NNs have a factor of luck with initialization, so you should develop some kind of quick evaluation framework/functions. That way you can experiment faster.
  • I haven’t considered any kind of pre-processing to data.

Instead of a neural only model(s), you can combine a good tree based model (for CORR) with a flexible NN trained on originality of predictions (corr+ MMC).

With this, I have (almost) opened up all of my core ideas around NNs for the tournament. I haven’t done anything new in particular, it’s just accumulation of interesting RocketChat and forum posts. I guess I have previously discussed about some specific things too at both places. and there are some direct references in this posts that simply indicate what my models are! :innocent:

To answer,
Yes, Its possible to beat example predictions with NNs. :smiley:

Above points are good enough to get you started with a really good basic model that you can later improve. Also,there is a lot of space for pre and post-processing! is Neural ensemble from 232 (meaning intelligence/understanding) is a single NN model from 232.

So (here), Neural Nets are (almost) all you need :wink:! Hope this helps
All the best :+1:


I don’t quite understand. What do model pruning and model distillation have to do with ensembling? Unless of course, if you’re talking about using model distillation to train a smaller student model using an ensemble of larger models as the teacher model.

Maybe I was unclear. I was referring to my experience of using ensemble of some big models like BERT(as seen in so many NLP competitions) in production. But distilling and making them efficient helped in efficient ensemble in production.

This also suggests there is a space for some architectural improvements.

1 Like