So, when reading the first 1000 columns of the v4.1/train_int8.parquet and v4/train_int8.parquet files there are massive differences even though the data are exactly the same:
data = pd.read_parquet('training_data_v41.parquet', columns=feature_cols[:1000])
data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2420521 entries, n003bba8a98662e4 to nfff2bd38e397265
Columns: 1000 entries, feature_honoured_observational_balaamite to
feature_intime_impassible_ferrule
dtypes: Int8(1000)
memory usage: 4.5+ GB
and
data = pd.read_parquet('training_data_v4.parquet', columns=feature_cols[:1000])
data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2420521 entries, n003bba8a98662e4 to nfff2bd38e397265
Columns: 1000 entries, feature_honoured_observational_balaamite to
feature_intime_impassible_ferrule
dtypes: int8(1000)
memory usage: 2.3+ GB
The 4.5 GB vs. 2.3 GB causes a 10.3 times slowdown in my pipeline. I figured the memory difference is probably because the capitalized âInt8â dtype of pandas has inherent support for missing values while the lowercase âint8â does not (I believe). But most columns donât have missing values, so in that case the organizers should store these as regular âint8â types.
Does anyone have a nice workaround for this? I guess you can cast it to int8 but I havenât really tested it yet (and it seems like this requires a fair bit of data copying).
I never use the ârawâ data. I download it once a week, fill in NAs, convert to int8, cache locally, and then just use my cache. Rinse and repeat when the historical data is updated.
Is the historical data updated every week, or how can we tell when it has actually been updated? I do not want to duplicate this massive processing, even if it is just once every week.
Nothing changes, it is only added to. Except this particular week, because MikeP said he found some things that werenât quite right in v4 & v4.1. So validation data has changed slightly, and training data has changed very very slightly apparently. But normally nothing changes.
It is detectable by being more than last time. Once in a while it is not updated for a week and then double the next week, but possibly theyâve addressed that so it is always adding an era a week (with new targets). Basically if you want to keep up-to-date, you need to download the validation data each week and compare the eras available (and the targets available if youâre interested in that) to what youâve already got, and add the new stuff. But you donât have to worry about the old stuff changing.
Let me clarify some more: a new era is added each week (usually), and targets are added to existing eras as those targets are available. (20d & 60d). So there will be some brand-new era added with no targets, some previous existing era will have 20d targets added, and farther back some era will get their 60d targets added. So there are actually changes to different eras (target-wise). But the feature data never changes (except in special cases like this week as per above), and once filled-in, the targets donât change.
I only care about eras with all targets filled-in (20d & 60d), so Iâm just looking for that one new fully finished era each week and I ignore all the newer stuff.