I am experiencing difficulties using v4 data because the
.csv data is missing from that set. Why? The .parquet readers waste unreasonable amounts of memory because they (stupidly) insist on reading all of the data instead of reading it in the time-honoured way, row-by-row.
Is the transmission bandwidth the problem?
In that case, may I ask for the
_int8.csv files to be compressed by some standard method please? In that form, they will be even smaller than the
.parquet versions. However, crucially, they will be much easier to use.
To back this assertion up, I performed a comparison on v3 version (last available non .parquet data). I used standard lzma compression with default settings.
227501648 Nov 1 12:05 numerai_validation_data.parquet
107301737 Nov 1 12:17 numerai_validation_data_int8.csv.lzma
As you can see, the
_int8.csv.lzma version is less than half the size.