About the new dataset and RAM usage

rigrog · September 21, 2021, 4:53pm

I’ll tell how I’m handling “super massive”: I’m collapsing it down to approximately “legacy” size.

The motivation, for the radical surgery that follows, can be seen quite plainly in the graph of inter-feature correlations, displayed in the cell “Out[7]” of analysis_and_tips.ipynb . Take a look at it! It shows 25 visually indistinguishable squares – no speck or squiggle out of place, in any of those 25 squares! Props to Numerai, for putting this redundancy front and center. They could very easily have obfuscated it.

For each “longRow”, of 1,050 int8 features (0, 1, 2, 3, or 4), I do:

shortRow = [sum( longRow[j : : 210] ) for j in range(210)]

Now I have 210 features, each with a value from 0 to 20 – which still fits in an int8.

Due to the high correlation, I wasn’t much surprised to find almost half the cells (in the 4.4M by 210 feature array) occupied by a whole multiple of 5 (i.e. 0, 5, 10, 15, or 20: what you’d get by adding 5 identical feature values). What did surprise me, was that 0 and 20, which can only be 0+0+0+0+0 or 4+4+4+4+4, each occupy 1/8 of the cells!

So if (like me) you have just a little more than enough RAM for the legacy tourney, you can probably squeeze in this 210 feature version of the super-massive. You won’t get much advantage (over legacy) feature-wise, but you will be in a significantly more “target rich environment” (a greater proportion of the rows are labeled with target values).

If you have abundant RAM, like 64 Gb and up, with a fleet of cores and GPU’s: maybe a 210-feature model could contribute to your grand ensemble.

Topic		Replies	Views
Saving memory with uint8 features Data Science	1	2765	December 20, 2022
How to train on the full V4 dataset with 8GB RAM Data Science	5	1508	October 6, 2022
Faster data loading with datatable Data Science	6	1811	April 9, 2021
Huge memory (and speed) differences between v4.1 and v4 data Tournament	9	1109	September 18, 2023
[Newbie Question!] Error When Running Example Notebook to Load Data Data Science	2	1228	December 25, 2021

About the new dataset and RAM usage

Related topics