Huge memory (and speed) differences between v4.1 and v4 data

selowan · August 9, 2023, 7:34am

So, when reading the first 1000 columns of the v4.1/train_int8.parquet and v4/train_int8.parquet files there are massive differences even though the data are exactly the same:

data = pd.read_parquet('training_data_v41.parquet', columns=feature_cols[:1000])
data.info()

  <class 'pandas.core.frame.DataFrame'>
  Index: 2420521 entries, n003bba8a98662e4 to nfff2bd38e397265
  Columns: 1000 entries, feature_honoured_observational_balaamite to 
  feature_intime_impassible_ferrule
  dtypes: Int8(1000)
  memory usage: 4.5+ GB

and

data = pd.read_parquet('training_data_v4.parquet', columns=feature_cols[:1000])
data.info()

  <class 'pandas.core.frame.DataFrame'>
  Index: 2420521 entries, n003bba8a98662e4 to nfff2bd38e397265
  Columns: 1000 entries, feature_honoured_observational_balaamite to 
  feature_intime_impassible_ferrule
  dtypes: int8(1000)
  memory usage: 2.3+ GB

The 4.5 GB vs. 2.3 GB causes a 10.3 times slowdown in my pipeline. I figured the memory difference is probably because the capitalized “Int8” dtype of pandas has inherent support for missing values while the lowercase “int8” does not (I believe). But most columns don’t have missing values, so in that case the organizers should store these as regular “int8” types.

Does anyone have a nice workaround for this? I guess you can cast it to int8 but I haven’t really tested it yet (and it seems like this requires a fair bit of data copying).

numerologist · August 14, 2023, 7:38pm

I never use the “raw” data. I download it once a week, fill in NAs, convert to int8, cache locally, and then just use my cache. Rinse and repeat when the historical data is updated.

dmmiller · August 19, 2023, 11:20pm

@numerologist, what do you mean by caching locally? Are you saving it to disk, or something else?

numerologist · August 20, 2023, 6:12pm

yeah, preprocess and then just pickle it

liborty · September 3, 2023, 12:03am

Is the historical data updated every week, or how can we tell when it has actually been updated? I do not want to duplicate this massive processing, even if it is just once every week.

wigglemuse · September 3, 2023, 1:53am

Nothing changes, it is only added to. Except this particular week, because MikeP said he found some things that weren’t quite right in v4 & v4.1. So validation data has changed slightly, and training data has changed very very slightly apparently. But normally nothing changes.

liborty · September 3, 2023, 2:26am

Thanks. But is it added to each week and is the added part detectable? Last few x lines?

wigglemuse · September 3, 2023, 2:41am

It is detectable by being more than last time. Once in a while it is not updated for a week and then double the next week, but possibly they’ve addressed that so it is always adding an era a week (with new targets). Basically if you want to keep up-to-date, you need to download the validation data each week and compare the eras available (and the targets available if you’re interested in that) to what you’ve already got, and add the new stuff. But you don’t have to worry about the old stuff changing.

wigglemuse · September 3, 2023, 2:50pm

Let me clarify some more: a new era is added each week (usually), and targets are added to existing eras as those targets are available. (20d & 60d). So there will be some brand-new era added with no targets, some previous existing era will have 20d targets added, and farther back some era will get their 60d targets added. So there are actually changes to different eras (target-wise). But the feature data never changes (except in special cases like this week as per above), and once filled-in, the targets don’t change.

I only care about eras with all targets filled-in (20d & 60d), so I’m just looking for that one new fully finished era each week and I ignore all the newer stuff.

hellozml · September 18, 2023, 4:07am

Switch to the new V4.2
Now I can read train and validation (downsample by 4) with a 16GB laptop with no kernel ‘suicide’. Was really surprised!!!

Finally - I think anyone can numeraie without excessive hardware upgrades.

“Here comes the Rain - The Cult”

Cheers

Topic		Replies	Views
Data Availability and Compression Methods Tournament	1	513	November 1, 2022
Which is the current dataset? Tournament	22	1916	November 9, 2022
V4/train.parquet changed from 311 to 312? Tournament	2	929	April 17, 2022
Reducing Memory Data Science	10	3931	January 19, 2021
How to train on the full V4 dataset with 8GB RAM Data Science	5	1511	October 6, 2022

Huge memory (and speed) differences between v4.1 and v4 data

Related topics