Super Massive Data: Sunshine

Super Massive Data: Sunshine

We believe that if we give the best data to the best community of data scientists, we will create the best hedge fund. And evidence is building that we may be right.

As recently as 2019, Numerai had only 40 features data scientists could use for their machine learning models. Last year, we expanded the number of features to 1050 with the Super Massive Data Release.

Today, we’re releasing the largest set of features since Super Massive Data.

The new dataset contains 405 new features. It also contains all of the features from v4.0, besides the 10 dangerous features described in the dangerous features post, for a total of 1586 features. It’s called Super Massive Data: Sunshine also known as Data Version 4.1.

Download Sunshine now from numer.ai/data. Step into the light.

Overview
Here’s a quick rundown of the updates talked about in this post:

  • 405 new features, for a total of 1586 features
  • Meta Model historical predictions released
  • New example script using new features, new targets, and better modeling
  • Can submit 20 more models per user now, for a total of 70 models
  • We are increasing the staking threshold by 20% from 300k360k on Numerai, and from 150k180k on Signals
  • Benchmark models are finally in development and coming soon
  • V3 dataset is being deprecated (but not breaking any automatic pipelines)

New Features
Based on tests with two identical LightGBM models, by only adding the new Sunshine features, average correlation against target_nomi would increase from 0.033 to 0.035. We expect the Numerai community to be able to make significantly larger improvements to their performance.

Information on how to download the data can be found at numer.ai/data.

One important difference between v4.0 and this new dataset is that some of the new features are not available for all of history. You will see some eras which contain missing values everywhere for some features. If you choose to use these features, make sure your code can properly handle NAs or use some method to impute the missing values.

Furthermore, we plan to add features to this dataset in the coming months, so make sure your code is robust to additional columns by always choosing the exact features and targets you want by name.

You can continue to use all previous versions of the data without your scripts breaking, but we believe large gains in performance and True Contribution are possible with Sunshine.

Meta Model Release

In addition to new features, the historical Numerai Meta Model is now also available for download.

With True Contribution, you are being rewarded for making positive alterations to the Meta Model.

Now you can test exactly how your model would have altered the Meta Model historically, and even train directly towards improving the Meta Model. For example, you could build a new target which is the difference between target_nomi and the Meta Model. In essence, training specifically to correct the mistakes the Meta Model makes.

The latest Meta Model predictions will be released 4 weeks after each weekend round, at the same time that the first 20D targets come out for that era.

We plan on adding these to diagnostics in the future to give better estimates of MMC than we are currently able to give today.

Download it at numer.ai/data or through the API with

from numerapi import NumerAPI
napi = NumerAPI()
napi.download_dataset("v4.1/meta_model.parquet")

New Example Script

There is a new example script, called example_model_sunshine.py, which uses the new data, new targets, and the new feature set. Below is the performance difference between the previous example model and the new one.

This Sunshine example script builds models on our 6 favorite targets, ensembles them, and does some partial feature neutralization, resulting in a model with much higher Correlation and Numerai Sharpe than the previous example script.

Research Suggestions

With the recent target release, the new data release, and the release of the Meta Model, there are many new directions of research to pursue.

To facilitate and encourage new research, we’ve increased the number of models that each user can submit by 20. This means you can now submit up to 70 models per week.

We’ve also raised the staking threshold by 20% for the Numerai Tournament, from 300k to 360k.

The easiest way to get started is to copy the new example script, which is one of our best internal models to date.

When you are ready to conduct new experiments, here are some of our ideas for the next most valuable directions to take your research.

  • Feature selection is more important than ever. This data set is much larger, with almost 1600 features. For one, there is a lot of redundancy in the features, and you can decrease your compute needs greatly by selecting a subset of features. Aside from saving compute though, using a subset of features for some of your models can increase the variety of your models and decorrelate them.
  • Build a new target by neutralizing an existing target to the Meta Model. A model trained on this target would then be trained specifically to correct the mistakes it expects the Meta Model to make.
  • Find a tournament round where your model gets an unexpected result on True Contribution, for example when you have high Corr and FNC, but your TC is low. For which targets does the Meta Model’s predictions outperform your model for this round? Does it do better on its top and bottom 200 predictions than your model? The answers may be clues to how you can alter your model to improve its True Contribution scores.
  • What’s the best way to deal with missing data? Should you fill missing features with the median? Is it better to leave them as NaN and let LGBM deal with it? Should you build a separate model that includes the features which have NaNs, and blend it with a model which has none of the NaN features? Can imputing the missing values with state-of-the-art imputation methods improve models?
  • Feature Neutralization - The new example script finds that 50% feature neutralization is a good balance between correlation and consistency (Numerai Sharpe). Is it better to neutralize only to a subset of features? Should some features be neutralized more aggressively than others?

Feel free to join us in Numerai Quant Club to discuss your findings on these topics or other ones.

Benchmark Models First Look

Here is the first wave of benchmark models that we’re working on.

If you look at the top model: LG_LGBM_V4_VICTOR60

This is a model which uses a large LGBM model (same parameters as the new example script), uses the V4 dataset, and trains using target_victor_v4_60.

The idea is for these models to be constantly testing various ways to build predictions, and showing their TC, in order to illuminate what the Meta Model needs more or less of.

It’s only been a few weeks, but so far all 12 of these models have positive TC, suggesting that more people should try building models in a similar fashion.

The next steps for us are

  1. To fill in these scores historically so that you can get a more complete picture of how each of these models has performed over time.
  2. To add the code to the example script repo showing how to build these models, so that you can build them yourself and improve on them directly.

The Sunshine example model has been added now, so you can track how it performs going forward.

Deprecating V3 Dataset

The V3 dataset has various problems that we solved in V4. These issues make models trained on V3 worse, and they make the dataset much harder to maintain.

That is why on April 1 2023, we will be doing a soft deprecation of the V3 dataset.

We recommend that you update any models which are still relying on the V3 dataset by retraining them on the V4.1 dataset, or replacing those models with new ideas entirely, based on experiments with Sunshine.

However, we will keep the V3 dataset running in order to avoid any completely broken pipelines.

There are a couple of changes happening.

First, we will remove the V3 dataset from the website. Although we will continue serving the V3 data files through the API in order to not break any automated pipelines.

Second, the V3 features will be slightly altered.

Recall that the V4.0 data has one-to-one analogs of all 1050 features in the V3 dataset, albeit slightly improved. So the features in the V3 dataset will be changed to be the V4 versions of those features (while retaining their V3 names of course) at the time of deprecation. This will have little to no effect for the vast majority of models as the corresponding V4 features are very highly correlated with the V3 version. Even so, it would be safer to have migrated to the V4 dataset already by this time.

Even though this is a very soft deprecation, we don’t take it lightly, and we don’t have any plans to deprecate any of the other datasets.

The V2 dataset will continue to be available as always, and we plan on continuing support for the V4 and V4.1 datasets for years to come.

29 Likes

I’ve been curious about how strong the meta-model was for quite a while, so thank you for providing the historical predictions.

However, I must say that I’m a bit disappointed after seeing them. I decided to benchmark the meta-model predictions against a plain LightGBM model trained on V4 data from prior (up to 724) eras. The LightGBM model got a spearman score of 0.0268 while the meta-model got a score of 0.0275 (both scored on eras 888-1028). Am I missing something here?

2 Likes

A couple of things that you might be missing:

  1. The Meta Model typically excels in consistency more than it does in pure mean. You should pay attention to statistics like drawdown as well when doing comparisons of model performances.

  2. V3 Data didn’t even exist yet at the start time of the Meta Model! So you’re comparing your V4 model vs a Meta Model which only had access to V2 data for the beginning, V3 data around half way through, and then V4 data towards the very end.

5 Likes

And didn’t even have the nomi target (which I assume you are scoring on?) until 40-something eras in. Be interesting how much difference from something trained on v2 data with uniform target.

Hi,

Please consider not altering the v3 features, even a little.

Thank you.

1 Like

The historical Numerai Meta Model data is such a great addition, well done. Also the 20 additional models per user are very welcome as well, especially now with all the new possibilities to test. If only you could deploy the account level staking feature too, then I would be happy to start trying new experiments again.

2 Likes

Randomly missing data is a really bad practice. There is no good way of dealing with it.

Randomly missing data is the rule rather than the exception with real-world data. There are certainly ways of dealing with it. And it is missing for everybody here so you’re at no disadvantage.

5 Likes

I meant there is no satisfactory way of dealing with it conceptually, except ad-hoc hackery.
On the technical side, if I am converting .parquet to .csv, how are these missing items going to show up? Will there be space between two commas , , in its place, or will all the following features in a row be shifted up and then missing at the end? Would you consider at least introducing a marker, such as ,2.0, (= missing item) , by which they can be recognized?

Ad-hoc hackery is underrated. And although you can’t really deal with questions of why something is missing in an obfuscated dataset, you can deal with it in a conceptually coherent way just treating it as “thing you must deal with that everybody else also has to deal with” and think about what is the best way that is going to work for your methods?

They’d show up as NaNs or NAs I believe but there may be settings somewhere to control that (in your import/parquet function) – we already have rows with some of the targets sometimes missing in v4 – same as those I woud think.

1 Like

Could you please explain how you arrived at 1586 features?
Under 4.0 we had 1191 + 405 new ones = 1596

There were 10 “dangerous” features removed from the data set.

2 Likes

Hi master_key,

Could you please elaborate on the meaning of:

We’ve also raised the staking threshold by 20% for the Numerai Tournament, from 300k to 360k.

Is that when the exponential decay for payout factor starts? or what is the new threshold for?

Thanks!

1 Like

Does the validation dataset now contain the daily eras? If yes how can we identify them? Round numbers and era numbers don’t match. It would be misleading to train or validate on dataset that has weekly cadence for most of it and daily cadence for the last part. This would scew the model towards towards the very end of it

Love the addition of meta_model prediction results! Regarding these; currently the meta_model.parquet consists of eras 0888-1038. Assuming Numerai continues to update this parquet file with new eras as they become available, is there a plan to back fill eras earlier than 0888? Is that even possible with nomi?

The Diagnostics eras start at 0857. I think it would at least make some sense to include all the eras from the diagnostics set, if possible

No all of the data is still only weekly

Not really possible to include older meta models just due to system changes prior to the first era chosen

You can see some more detail here: Numerai Tournament Overview - Numerai Tournament

The gist is that once there is more than 300k staked across all users, everyone’s payouts are multiplied by 300k / total_stake .

But that 300k is being changed to 360k now.

It’s great that the historical meta model predictions have been released. I’m wondering what the correct way to calculate TC of my models based on his is though. Any pointers?

Just in case, should we expect NAs to appear in tournament data as well, or are they only in the historical data? Thank you!

2 Likes