I downloaded the new v4 data and the last era (era 1004) has target filled with NaNs.
Does the era 1004 correspond to the round 310? Respectively, is the era 1004 the current era?
Also, are there any missing weeks? Is it always true that era(X) and era(X+1) corresponds to consecutive weeks?
Another thing that I found is that compared to V3 some eras have different amount of instances. Is that OK? (I checked only validation data.)
How does the mapping of features in features.json work? I wanted to harvest my previous research on features so I wouldn’t have to start over again; however, I found only an array of features inside the json file.: json.load('features.json')['feature_sets']['v3_equivalent_features'], so I thought it should match the ordering of features from the V3 dataset, but when I run correlation test on each pair it is not exact match.
so just to settle this question, because I was wondering the same:
I downloaded the new v4 data and the last era (era 1004) has target filled with NaNs.
Does the era 1004 correspond to the round 310? Respectively, is the era 1004 the current era?
Is it possible that the json file “v4/features.json” contains invalid json?
I am downloading the file via NumerAPI napi.download_dataset("v4/features.json", "features.json")
When trying to parse the file using:
with open(".../<path to feature file>/v4/features.json", "r") as f:
feature_metadata = json.load(f)
features = feature_metadata["feature_sets"]["v3_equivalent_features"]
I get the following error:
JSONDecodeError("Extra data", s, end)
JSONDecodeError: Extra data
Also the Chrome extension “{JSON} Editor” tells me that the json is invalid.
All, I am just starting to work with the V4 data. I have a question, in one of the older example scripts I have, the data is sampled at every 4th era to prevent “overlap,” but looking at example scripts now, that bit of code is removed. Is it still required to get an un-duplicated set of data to sample every 4th era?
Yes. That doesn’t mean you can’t train on adjacent eras, but you shouldn’t train on era X and then test/validate on era X+1 (or X+2 or X+3) because they overlap in the 20 day (4 week) targets. If you use the 60 day (12 week) targets, then your gap should be 12 eras to cover that. So it’s all about avoiding “false validation” which may or may not come into your training/validation setup. For instance, I just train on all 574 training eras (at least – with v4 even more is available) and only use the later eras validation/testing. Training on every 4th era is obviously a lot quicker and less resource-intensive too since it is only 25% of the data. You are not losing as much as it sounds like this way because the overlapped eras do tend to be very similar.
There is no difference between the train and validation data other than those labels. train+val as a block is simply consecutive eras (weeks) 1-1000something (whatever it is up to now).