V4 data realease - questions

sneaky · April 7, 2022, 10:47pm

Hi,

I downloaded the new v4 data and the last era (era 1004) has target filled with NaNs.
Does the era 1004 correspond to the round 310? Respectively, is the era 1004 the current era?
Also, are there any missing weeks? Is it always true that era(X) and era(X+1) corresponds to consecutive weeks?
Another thing that I found is that compared to V3 some eras have different amount of instances. Is that OK? (I checked only validation data.)

era  V3   V4
0871 4910 4911
0872 4918 4919
0873 4932 4933
0875 5051 5052
0902 4997 5002
0912 5174 5182
0913 5001 5009
0914 5203 5212
0915 5185 5193
0916 5183 5191
0936 5192 5191

How does the mapping of features in features.json work? I wanted to harvest my previous research on features so I wouldn’t have to start over again; however, I found only an array of features inside the json file.: json.load('features.json')['feature_sets']['v3_equivalent_features'], so I thought it should match the ordering of features from the V3 dataset, but when I run correlation test on each pair it is not exact match.

0.9930286498066472
0.9973728390894694
0.9618079870303051
0.9936341791676602
0.9831262504964846
0.9934326084172069
0.9990904802940008
0.9916135182068833
1.0
1.0
1.0
0.9996968267646669
0.998989490641361
0.9987875113360812
0.9989895927800676
0.9989895927800676
1.0
0.9937375180105527
0.9704001245800853
0.9956573427977774
0.9165522644319917
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0

Thank you!

Sneaky

kgareth · April 16, 2022, 2:27am

Another question: Will there be int8 versions for V4?

shatteredx · April 16, 2022, 3:58am

@sneaky GitHub - miciasto/numerai

sneaky · April 16, 2022, 6:30pm

It is there, I managed to download it via api. I am afk, but I think you just need to add sufix _int8.

mundan · April 16, 2022, 10:20pm

how can you specify in the api to download from v4 instead of v3?

eleele · April 16, 2022, 11:17pm

@mundan see https://numer.ai/data/v4

orbitalteapot · April 17, 2022, 8:37am

This link should definitely be added to Numerai Tournament Overview - Numerai Tournament .

mundan · April 17, 2022, 9:37am

thanks! I ended up there

mundan · April 17, 2022, 10:59am

Find here the shapes of the v3 data and v4 datav3/train v3/val v3/tour v4/train and v4/val respectively

 (2412105, 1073)  # v3/train
  (539658, 1073)  # v3/val
 (1412927, 1073)  # v3/tour
 (2420521, 1214)  # v4/train
 (2203644, 1214)  # v4/val

v4/val has some test entries, but most of them are validation entries (2176973 of them have targets)

No more unlabeled data then!

slowmoe · May 4, 2022, 9:05am

so just to settle this question, because I was wondering the same:

I downloaded the new v4 data and the last era (era 1004) has target filled with NaNs.
Does the era 1004 correspond to the round 310? Respectively, is the era 1004 the current era?

which means:

the last era that is shipped in validation data corresponds to what was shipped as live data the week before
in other words: era = round + 695

halvar · May 8, 2022, 6:48pm

Is it possible that the json file “v4/features.json” contains invalid json?

I am downloading the file via NumerAPI
napi.download_dataset("v4/features.json", "features.json")

When trying to parse the file using:

with open(".../<path to feature file>/v4/features.json", "r") as f:
    feature_metadata = json.load(f)
features = feature_metadata["feature_sets"]["v3_equivalent_features"]

I get the following error:

JSONDecodeError("Extra data", s, end)
JSONDecodeError: Extra data

Also the Chrome extension “{JSON} Editor” tells me that the json is invalid.

Am I missing something here?

jnolan9 · July 1, 2022, 2:54pm

All, I am just starting to work with the V4 data. I have a question, in one of the older example scripts I have, the data is sampled at every 4th era to prevent “overlap,” but looking at example scripts now, that bit of code is removed. Is it still required to get an un-duplicated set of data to sample every 4th era?

thanks

Mike

wigglemuse · July 1, 2022, 3:25pm

Yes. That doesn’t mean you can’t train on adjacent eras, but you shouldn’t train on era X and then test/validate on era X+1 (or X+2 or X+3) because they overlap in the 20 day (4 week) targets. If you use the 60 day (12 week) targets, then your gap should be 12 eras to cover that. So it’s all about avoiding “false validation” which may or may not come into your training/validation setup. For instance, I just train on all 574 training eras (at least – with v4 even more is available) and only use the later eras validation/testing. Training on every 4th era is obviously a lot quicker and less resource-intensive too since it is only 25% of the data. You are not losing as much as it sounds like this way because the overlapped eras do tend to be very similar.

jnolan9 · July 4, 2022, 10:22pm

Is this also the case with the validation data?

wigglemuse · July 4, 2022, 10:29pm

There is no difference between the train and validation data other than those labels. train+val as a block is simply consecutive eras (weeks) 1-1000something (whatever it is up to now).

Topic		Replies	Views
V4 diagnostic data Tournament	2	712	May 7, 2022
V4 Tournament Data Announcement Announcements	0	3456	March 28, 2022
Missing eras in tournament data Tournament	5	1679	November 16, 2020
How does training data and validation data relate in "time"? Tournament	8	1839	May 6, 2021
Relation between rounds eras and validation set eras Tournament	1	835	February 20, 2023

V4 data realease - questions

Related topics