Super Massive Data: Sunshine
We believe that if we give the best data to the best community of data scientists, we will create the best hedge fund. And evidence is building that we may be right.
As recently as 2019, Numerai had only 40 features data scientists could use for their machine learning models. Last year, we expanded the number of features to 1050 with the Super Massive Data Release.
Today, we’re releasing the largest set of features since Super Massive Data.
The new dataset contains 405 new features. It also contains all of the features from v4.0, besides the 10 dangerous features described in the dangerous features post, for a total of 1586 features. It’s called Super Massive Data: Sunshine also known as Data Version 4.1.
Download Sunshine now from numer.ai/data. Step into the light.
Overview
Here’s a quick rundown of the updates talked about in this post:
- 405 new features, for a total of 1586 features
- Meta Model historical predictions released
- New example script using new features, new targets, and better modeling
- Can submit 20 more models per user now, for a total of 70 models
- We are increasing the staking threshold by 20% from
300k
→360k
on Numerai, and from150k
→180k
on Signals - Benchmark models are finally in development and coming soon
- V3 dataset is being deprecated (but not breaking any automatic pipelines)
New Features
Based on tests with two identical LightGBM models, by only adding the new Sunshine features, average correlation against target_nomi would increase from 0.033 to 0.035. We expect the Numerai community to be able to make significantly larger improvements to their performance.
Information on how to download the data can be found at numer.ai/data.
One important difference between v4.0 and this new dataset is that some of the new features are not available for all of history. You will see some eras which contain missing values everywhere for some features. If you choose to use these features, make sure your code can properly handle NAs or use some method to impute the missing values.
Furthermore, we plan to add features to this dataset in the coming months, so make sure your code is robust to additional columns by always choosing the exact features and targets you want by name.
You can continue to use all previous versions of the data without your scripts breaking, but we believe large gains in performance and True Contribution are possible with Sunshine.
Meta Model Release
In addition to new features, the historical Numerai Meta Model is now also available for download.
With True Contribution, you are being rewarded for making positive alterations to the Meta Model.
Now you can test exactly how your model would have altered the Meta Model historically, and even train directly towards improving the Meta Model. For example, you could build a new target which is the difference between target_nomi and the Meta Model. In essence, training specifically to correct the mistakes the Meta Model makes.
The latest Meta Model predictions will be released 4 weeks after each weekend round, at the same time that the first 20D targets come out for that era.
We plan on adding these to diagnostics in the future to give better estimates of MMC than we are currently able to give today.
Download it at numer.ai/data or through the API with
from numerapi import NumerAPI
napi = NumerAPI()
napi.download_dataset("v4.1/meta_model.parquet")
New Example Script
There is a new example script, called example_model_sunshine.py, which uses the new data, new targets, and the new feature set. Below is the performance difference between the previous example model and the new one.
This Sunshine example script builds models on our 6 favorite targets, ensembles them, and does some partial feature neutralization, resulting in a model with much higher Correlation and Numerai Sharpe than the previous example script.
Research Suggestions
With the recent target release, the new data release, and the release of the Meta Model, there are many new directions of research to pursue.
To facilitate and encourage new research, we’ve increased the number of models that each user can submit by 20. This means you can now submit up to 70 models per week.
We’ve also raised the staking threshold by 20% for the Numerai Tournament, from 300k to 360k.
The easiest way to get started is to copy the new example script, which is one of our best internal models to date.
When you are ready to conduct new experiments, here are some of our ideas for the next most valuable directions to take your research.
- Feature selection is more important than ever. This data set is much larger, with almost 1600 features. For one, there is a lot of redundancy in the features, and you can decrease your compute needs greatly by selecting a subset of features. Aside from saving compute though, using a subset of features for some of your models can increase the variety of your models and decorrelate them.
- Build a new target by neutralizing an existing target to the Meta Model. A model trained on this target would then be trained specifically to correct the mistakes it expects the Meta Model to make.
- Find a tournament round where your model gets an unexpected result on True Contribution, for example when you have high Corr and FNC, but your TC is low. For which targets does the Meta Model’s predictions outperform your model for this round? Does it do better on its top and bottom 200 predictions than your model? The answers may be clues to how you can alter your model to improve its True Contribution scores.
- What’s the best way to deal with missing data? Should you fill missing features with the median? Is it better to leave them as NaN and let LGBM deal with it? Should you build a separate model that includes the features which have NaNs, and blend it with a model which has none of the NaN features? Can imputing the missing values with state-of-the-art imputation methods improve models?
- Feature Neutralization - The new example script finds that 50% feature neutralization is a good balance between correlation and consistency (Numerai Sharpe). Is it better to neutralize only to a subset of features? Should some features be neutralized more aggressively than others?
Feel free to join us in Numerai Quant Club to discuss your findings on these topics or other ones.
Benchmark Models First Look
Here is the first wave of benchmark models that we’re working on.
If you look at the top model: LG_LGBM_V4_VICTOR60
This is a model which uses a large LGBM model (same parameters as the new example script), uses the V4 dataset, and trains using target_victor_v4_60.
The idea is for these models to be constantly testing various ways to build predictions, and showing their TC, in order to illuminate what the Meta Model needs more or less of.
It’s only been a few weeks, but so far all 12 of these models have positive TC, suggesting that more people should try building models in a similar fashion.
The next steps for us are
- To fill in these scores historically so that you can get a more complete picture of how each of these models has performed over time.
- To add the code to the example script repo showing how to build these models, so that you can build them yourself and improve on them directly.
The Sunshine example model has been added now, so you can track how it performs going forward.
Deprecating V3 Dataset
The V3 dataset has various problems that we solved in V4. These issues make models trained on V3 worse, and they make the dataset much harder to maintain.
That is why on April 1 2023, we will be doing a soft deprecation of the V3 dataset.
We recommend that you update any models which are still relying on the V3 dataset by retraining them on the V4.1 dataset, or replacing those models with new ideas entirely, based on experiments with Sunshine.
However, we will keep the V3 dataset running in order to avoid any completely broken pipelines.
There are a couple of changes happening.
First, we will remove the V3 dataset from the website. Although we will continue serving the V3 data files through the API in order to not break any automated pipelines.
Second, the V3 features will be slightly altered.
Recall that the V4.0 data has one-to-one analogs of all 1050 features in the V3 dataset, albeit slightly improved. So the features in the V3 dataset will be changed to be the V4 versions of those features (while retaining their V3 names of course) at the time of deprecation. This will have little to no effect for the vast majority of models as the corresponding V4 features are very highly correlated with the V3 version. Even so, it would be safer to have migrated to the V4 dataset already by this time.
Even though this is a very soft deprecation, we don’t take it lightly, and we don’t have any plans to deprecate any of the other datasets.
The V2 dataset will continue to be available as always, and we plan on continuing support for the V4 and V4.1 datasets for years to come.