Which is the current dataset?

There seem to be at least four different versions of datasets now and new ones are being added without warning. I may have been using an old version to my cost.
Anyway, I have only just updated to v3 and the diagnostics have improved.
What about v4? Is there any notice anywhere that would say at all times which is currently the latest data to use and explain its properties?

Or can at least someone tell me here now? Should I have gone straight to v4?

Did you miss this old announcement? https://forum.numer.ai/t/super-massive-data-release-deep-dive/

And this might help too Numerai

The above posted link to Numerai has tabs for all the dataset versions.

Also see this post: Removing Dangerous Features

Thanks. I guess I should be using v4 then but I can not read .parquet files.
Are there .int8.csv versions still available somewhere, as for the previous versions?

What I mean is I can not convert .parquet because most converters try to stupidly read the entire file all at once and I do not have enough memory for that.
I need to be able to read this data line by line, removing the nonsensical ‘feature names’ and converting to plain .csv.

I had the same memory issues, for that I wrote a script that reads the parquet file once and splits it up into single era csv files. With csv files you can explicitly force the column data types at readout to be float16 , which I really wish would be available for parquet as well. However there is the "v4/train_int8.parquet" file, this file should be more manageable but I haven’t tried it

"v4/train_int8.parquet" - yes, those are the files I am trying to use but they are still too big for the parquet readers.

I did not miss the original supermassive announcement and I did reprogram everything to work with it. That data is now called v3, I believe. I was downloading automatically _int8 files. I did miss any announcements that there may have been later about the labelling and v4. Since _int8 files were for some reason discontinued in v4, numerapi downloader just kept on getting v3. I was totally unaware that anything was amiss, apart from the fact that my model started performing badly and I lost money. Not an edifying experience.

As a result, my opinion of numerai is currently very low.

int8 files were not discontinued, but csv files were for the newer data set.

In any case, none of the data updates have anything to do with your model performance going down – that’s just the market vs your model. Plenty of people are still running models based on both v3 & v2 data.

But why were the csv files terminated? There is no good reason for it. See my more recent post entitled ‘Data Availability and Compression Methods’.

Changing data formats like this is not at all user friendly. Some of us have our own processing implementations which rely on the original format.

Too big, I guess. What was supposed to happen with the parquet files and/or api was the ability to pick and choose which eras (and even which columns) you wanted to download (and apparently parquet files are capable of that if you set it up properly). But that was never implemented. (Unless there is some magic there that actually does work that nobody knows about?)

The trouble is really only with the val set because it is a big file and changes every week. For submissions, we can download the live set only, and we only need to submit the live era now as well so that’s become much nicer and a lot faster.

The training file is huge but you only ever have to process once since it doesn’t change. I know there are those that download the training file every week, but frankly that is just a very poor way to do things so I’m going to put the blame on the user if they do that because doing that is silly – just save the data.

But with the validation set the feature data never changes but does get added to with new eras and targets as they become available. It is somewhat smaller than the training file (talking v4 here) but of course growing all the time. So that’s the one that has to be dealt with if you want those updates regularly, and they really should break it up or fix the api so we can just download the new stuff.

That is all true, except that .csv is not “too big” but rather the opposite, when you apply some standard no-nonsense compression to it. I did an experiment on v3 data in the other thread, which proves that it is, in fact, less than half the size of your chosen format:

227501648 Nov 1 12:05 numerai_validation_data.parquet
107301737 Nov 1 12:17 numerai_validation_data_int8.csv.lzma

PS. I understand that the training data does not change. However, that is the problem. Given that it was collected in some remote past, when the market conditions were very different, its validity is questionable at best. I think numerai is falling here into the trap of making the same comfortable assumptions as most ‘investment advisors’. They all subsist on the assumption that markets are always going up and up and are happy to cash in on it. Frankly, as long as that assumption holds, it takes zero skill anyway. When the markets inevitably turn sour, then it is: “sorry buddy, not my fault”.

Man, does anyone else feel the quality of the numerai operations have gone downhill the past few months? It just seems I have been noticing more bugs and weird things. I don’t know if I was just not paying attention before or something. Nothing major, but makes me feel a bit anxious on the stability of the system.


So…then use the newest data instead? Or do Signals. Nobody is forcing anything on you.

btw, there is an int8 parquet val file which is currently about ~1GB

I’m quite worried too. Don’t think we’re quite at the ‘99.99% sure everything’s correct’ level yet that @slyfox spoke about in the last FSC. Difficult to tell from the outside but there appears to be plenty of opportunity to strengthen controls around change and incident management.


Yeah, I recently had a pretty large drop suspiciously around the same time that the daily uploading started. Definitely makes me feel a bit anxious about staking on dailies now, as now I’m uncertain whether the drop is related to a bug in the daily pipeline, but could be coincidence.

Roughly a 20% drop in 1 day on my model: Numerai

By far the largest drop since the inception of my model. I might have to lower my TC stake until I can feel more confident that things are working as intended.


The non changing training data was actually one of the reasons I quit numerai some time ago. I came back once I saw that numerai now uploads an updated file containing the latest eras with targets. This means you can choose your own train/test split up until the newest eras. I guess the reason why the “vanilla” train/test split exists is to protect newcomers from being overconfident in their models.

Regarding your issues with parquet files: I am a little bit confused where your memory bottleneck is, is it your disk space or your RAM that is constraining you? To me it sounds like you are trying to train your model in a very disk and RAM constrained environment.
Because parquet files are column oriented as opposed to row oriented csv files, what you can do is read only a few columns of a parquet file, read them at the specfic row, clear your RAM from the parquet file and read the next columns. Obviously this is very slow, but you can use this technique to at least split up the data into single eras which are more handy.

Another tip: you don’t have to read all columns, columns 210 - 1050 are almost identical copies of columns 0-209 (see correlation matrix here in cell 12), so you can save memory by just reading columns 0-209 and columns 1050-1191, leaving you with 351 features that have almost the same information content.


I am now totally confused. I am trying to change to v4 dataset but the tournament file is missing from it. However, the data description on the web says that it is the only one that changes weekly and must be used to generate the predictions. But the tournament files before v4 have different number of features and thus are incompatible with v4.
I can generate live predictions based on v4 training and v4 live data files but where do I get the test predictions from?

You only need to submit the live era. (Now true no matter which dataset you use – just submit live, it is way faster.) In v4 the validation set grows each week – new eras with targets added as they become available. (You can use these for training as well, or even exclusively.) There are still a few “test” eras at the end of the validation file – these are the most recent eras that don’t have targets yet.

1 Like

Thanks! I wish this kind of crucial information would be posted somewhere prominently instead of misleading outdated instructions.
I notice that the v4 validation file is all clean now, with the new features.
However, the v4 training set is useless and conflicting, as it still contains the old feature set.