Numerai Self-Supervised Learning & Data Augmentation Projects

original Tweet thread on this topic

Project 1

Project 2

Discussion on Methods

From our first Twitter Spaces discussion today, @jrb recommended Contrastive Self-Supervised Learning worked well for him for this project or creating new features.

Principal Components Analysis would also be a very basic way to generate unsupervised features like this. The goal is to make these new features maximally helpful for some later model to train on and I don’t think PCA works especially well but haven’t checked in a while on the current data.

Another method discussed was Diffusion Models. These models would take in a very noisey version of an era matrix and output another matrix which looks more like a real era. This models have had excellent results generating realistic images from noise.
Diffusion Models Paper

We also discussed how the solution to the Jane Street Competition on Kaggle involved using an auto encoder to create new features. See also the excellent thread on this with @jrai on Numerai’s forum.

Discussion where this fits with Numerai
If Numerai could create excellent new features or eras from these methods we could potentially make them available to everyone as features in our Data API. We could also potentially learn the new synthetic features or eras on our raw data which could be even more useful. Ultimately, giving out more features and more data to Numerai users using these methods could improve everyone’s models significantly.

If you want to work on these projects, then describe how you plan to solve them in a reply to this blog post. Code snippets will be useful. If have gotten to the point of demonstrating the success of your method, I would be happy to get on a call 1-1 with you to take a look. But first try to convince me here on the forum that it’s good and ready for me to criticize. We do want to share this research publicly so anyone can use. It might even make it into an example script some day. We use PyTorch a lot, I think it would be best if you could use that if you know it but we’re not super strict.

I am hoping to make rapid progress on this this month in March. If you have something good, Numerai will get you flights and hotel to come present it to everyone at NumerCon on April 1 in SF and will also give large retro-active bounty if its especially good and our chief scientist Michael Oliver would also want to interview you for a full time job in research at Numerai.


Hi Richard,

Umap generates useful features from the dataset, while reducing dimensionality. Unlike PCA, Umap works with the Numerai dataset.
I got the idea, from Marcos Lopez de Prado’s lecture:

I’ve been using it for a while.

See here:
Numerai (a newer but apparently improved version)
Numerai (since round 275)

It won’t hit the #1 spot on the leaderboard based on CORR, but the created model has ~10% correlation with the metamodel and MMC/CORR ratio is good. It will be interesting to see it’s TC score.

Best part of it is, that it’s very simple:
fit = umap.UMAP(n_components=100, min_dist=0)
transformed_data = fit.fit_transform(data)

Because it’s learns embedding without the labels, I can use test AND live for the umap model as well.
Once the dataset is transformed you can train any model on it.


Hi Richard, Interesting initiative.
I see three problems here:

  • If we have a potentially new good idea in mind, to develop and test it with new rounds is not possible before 1st April.
  • If the idea really work I can’t imagine a ‘large retro-active bounty’ that can compensate a team have several thousands of NMR at stacking.
  • You are asking for write the ideas in open forum so if we write one and you decide it is not interesting, we will loose all the advantage of this idea respect the rest of community.

To back up my claims in my previous reply I created a sample script for applying umap to the dataset.

You can find it here with all related validation data:

Note: this is meant to be fast and simple.
It uses only the “medium” featureset and there are tons of ways to optimalize the code and improve the results.

Results are the following:

Extending the dataset with the umap features improves basically all validation metrics. And it’s worth noting again, that these improvement can be further magnified by better model and hyperparameter selection. Still it’s performance is highly correlated with the baseline model.

I included the last diagnositics to show that while it’s corr is low it has the highest MMC mean and I bet it will have the highest TC as well.

It’s not alien performance, but it’s a very simple transformation that can be used by anyone.


Hi Rechard, thanks for interesting projects (and of course creating Numerai)!

I have been interested in both self-supervised learning and data augmentation approaches, but haven’t really worked on them. So this is a good time to push myself a bit:)

I started with the Project 2: data augmentation.

My approach is very simple: cutout, where randomly selected columns per era are set to be 0.5. Those ‘new’ rows are concatenated to the original train data (so more rows apparently).

In this way I expect that whatever model we train is forced to learn from variety of features, making their predictions robust.

Here is my experimental setup:

Data: [Numerai] train & validation with kazutsugi & nomi

This is an old data from you with both target_nomi and target_kazutsugi available. What’s good with this data is that the size is handful enough for me to experiment many things. Also the validation data is fixed so no update of the validation score every week.

Model: XGBoost

I compared validation scores from the baseline XGBoost and XGBoost with the cutout. Of course the only difference is whether there is a cutout or not.

The entire code is available:

The validation score from the baseline is this:

スクリーンショット 2022-03-03 20.22.02

This is from the one with the cutout:

スクリーンショット 2022-03-03 20.22.53

The validation period is split into two: era < 150 (val1) and era > 150 (val2). The val2 is harder to predict, which you should know.

We can see good improvements in Corr Sharpe and Max Feature Exposure in the both validation periods!

I have to say, this is still a wip but for me looks promising.


Just got the TC scores on the above mentioned models.

Numerai is at 116 place on TC score
Numerai which is a neutralized version of the same model is #78

Wen staking on TC???


Can only contribute with some inspo atm…

And (from 47:15) for Yann LeCun explaing it (AI folk at Meta)

Very nice. It’s good to watch for a jump in FNC as FNC is the most correlated with True Contribution.

Totally get it. This is just for people who are willing to publish and share their ideas.

1 Like

I think ppl can just DM you with their ideas. If it turns out to be a great idea, you offer them the prize, and then they need to share the idea.

I also tried to use UMAP before. I used it on all the features and failed because of lack of ram space. Even apply it on medium feature set, it require a lot ram space, right?

Yes @sunkay, that would be the whole purpose of this initiative and that is why I didn’t hesitate much to opensource it!

The idea works for sure! In fact, it’s proven to be valuable. My best Umap model ranks #60 on TC at the time of writing, but it’s based on the LEGACY dataset.

I’ve got 64GB RAM with 24GB GPU and I can’t crack the full new dataset with this hardware.
The medium featureset is barely doable with some tricks to save ram.

However if @richai jumps in, he can make these features available for everyone. Including me :slight_smile:
That’s the goal of these projects, right? Find valuable features and make them available for everybody for free.

If the legacy dataset brings a model to #60, the featureset on the full dataset can go even closer to the top.
And while not everybody has the resources to calculate this featureset, it’s not going to break the bank at Numerai for sure. They have heavier workloads than this one.


Hi Richard, let me share my approach regarding the Project 1: Self-supervised learning.

My approach is again very simple: factor analysis. It is more like a feature engineering technique like PCA rather than SSL but I guess this can fit in this project category.

I expect the factor analysis to find ‘common factors’ which generate numerai features. If we could find out such ‘common factors’, those ‘factors’ would be something less noisy than the original numerai features but effectively captures what constitutes them.

My experimental setup is the same as my approach to the project 2.

Data: [Numerai] train & validation with kazutsugi & nomi

The entire code is again available:

The validation score from the baseline is this (same as the project 2):


This is from the one with the factor analysis:


The improvement in the Corr Sharpe can be seen again in the all validation periods!

In this post I would like to share that even a simple unsupervised learning technique can contribute to improving scores, so a more fancy SSL could improve them a lot!


Based on my observations,the new dataset is much noisier than the legacy dataset. I think features generated by umap would be better if we do feature selection first and then apply umap to those features.


I am tackling the problem of creating synthetic features using a deep autoencoder.

The autoencoder takes two inputs: the features for a single row, and the era number.
The autoencoder does two kinds of augmentation to the inputs:

  • 1.) Maps them through a randomly initialized, frozen deep network. This is called extreme learning.
  • 2.) Concatenates the original features with 0.3 dropout to the “extreme” features.

The model encodes these inputs to a 12-dimensional latent space. Then it decodes the latent back to the full original feature space and is scored with mean squared error.
I train only on train-data eras.
I found it improved generalization to linearly interpolate eras from [0, ~550] down to [0, 12]. For example, era 200 will become era 5. During validation, era is set to the max value seen during training.

Once this model is trained, I create new features in two ways:

  • 1.) Use the 12 dimensional latent space as new features.
  • 2.) Use the argmax of the 12 dimensional latent space as a feature.

Measuring the New Features

  • Baseline: example model w/ LGBMRegressor, n_estimators = 2000
  • Baseline+Era: baseline + era feature
  • Synthetic Feature Raw: baseline + 12 synthetic features from autoencoder latent
  • Synthetic Feature Argmax: baseline + 1 synthetic feature, argmax of latent
  • Synthetic Feature Argmax+Era: above + era feature

Here are the validation scores:

The code is available on my github here:


Going Deeper

Below are plots of the out of sample improvement over the baseline LGBMRegressor correlation.

The raw latent representation helps at first and quickly decays with time:

Taking argmax helps:

Is the latent-argmax just secretely passing the era number to the model?
No! Once we add in the era as well, it gets even better. This implies the learned representation is helping.

Baseline with era feature for comparison with above.


Wow some great work here already. @nyuton thanks for the tip about UMAP, I was pondering the use of manifold learning methods for this project so I know what I’m going to try next!

I just wanted to post here I just tried a couple baseline ideas of adding 2 types of noise to the data. One is adding Gaussian noise and the other is to create new data by averaging data points. Both of these are simple ideas for create more rows of data.

For each case slight improvements can be seen in cross validation metrics when combining the raw data with the noisy data, but not just by using the noisy data. This seems consistent with some other conclusions above, where best performance is seen when attaching, or appending the new data to the original data, and not just replacing it outright.

For the Gaussian noise case, slight increase in mean is accompanied by higher variance, bring the sharpe down.

I just wanted to share for the sake of documentation. It doesn’t seem to make too much of a difference, to use either of these simple methods, especially when looking at the validation data.

Code available here: GitHub - jefferythewind/numerai-sandbox: A Repo to Share Scripts for Numerai

1 Like

I feel like you’ve really cracked a big part of the TC puzzle here. I’ve just tried some UMAP features per your instructions on the whole dataset. I definitely see the same kind of performance in the diagnostics, albeit without the MMC performance that yours showed. I would find it hard to be so sure that it would perform well on TC. Looking at your model scores, the TC sticks our by far as the best metric, top 100, where the others, even MMC are still pretty mediocre. Really interesting. A question I am having is why the correlation performance is so bad for the transformed data?

Fun and simple idea for generating fake data: make a generative model of the features to target relationship. 1) fit a ridge regression model for each era 2) fit a Gaussian mixture model on all the learned regression weights 3) to generate new data take an era, sample the GMM to get beta weights, use features and beta weights to create fake raw return data, rank and bin to create new targets 4) repeat and train to infinity 5) profit! (South Park reference, not investment advice)
Here is a prototype for anyone interested


your correlation mean is very high, is it a CV score?