I am experiencing difficulties using v4 data because the .csv
data is missing from that set. Why? The .parquet readers waste unreasonable amounts of memory because they (stupidly) insist on reading all of the data instead of reading it in the time-honoured way, row-by-row.
Is the transmission bandwidth the problem?
In that case, may I ask for the _int8.csv
files to be compressed by some standard method please? In that form, they will be even smaller than the .parquet
versions. However, crucially, they will be much easier to use.
To back this assertion up, I performed a comparison on v3 version (last available non .parquet data). I used standard lzma compression with default settings.
227501648 Nov 1 12:05 numerai_validation_data.parquet
107301737 Nov 1 12:17 numerai_validation_data_int8.csv.lzma
As you can see, the _int8.csv.lzma
version is less than half the size.