Data Availability and Compression Methods

liborty · November 1, 2022, 1:30am

I am experiencing difficulties using v4 data because the .csv data is missing from that set. Why? The .parquet readers waste unreasonable amounts of memory because they (stupidly) insist on reading all of the data instead of reading it in the time-honoured way, row-by-row.

Is the transmission bandwidth the problem?

In that case, may I ask for the _int8.csv files to be compressed by some standard method please? In that form, they will be even smaller than the .parquet versions. However, crucially, they will be much easier to use.

To back this assertion up, I performed a comparison on v3 version (last available non .parquet data). I used standard lzma compression with default settings.

227501648 Nov 1 12:05 numerai_validation_data.parquet
107301737 Nov 1 12:17 numerai_validation_data_int8.csv.lzma

As you can see, the _int8.csv.lzma version is less than half the size.

shatteredx · November 1, 2022, 1:58pm

You can use Dask dataframes instead of pandas to read the parquet files without loading them into memory.

Example: Read parquet, only keep every 4th era, and compute to pandas

import dask.dataframe as dd
training_data = dd.read_parquet('train.parquet')
# pare down the number of eras to every 4th era
every_4th_era = training_data[ERA_COL].unique()[::4].compute()
training_data = training_data[training_data[ERA_COL].isin(every_4th_era)].compute()

Topic		Replies	Views
Huge memory (and speed) differences between v4.1 and v4 data Tournament	9	1109	September 18, 2023
Reducing Memory Data Science	10	3931	January 19, 2021
How to train on the full V4 dataset with 8GB RAM Data Science	5	1514	October 6, 2022
Faster data loading with datatable Data Science	6	1814	April 9, 2021
Which is the current dataset? Tournament	22	1916	November 9, 2022

Data Availability and Compression Methods

Related topics