About the new dataset and RAM usage

I’ll tell how I’m handling “super massive”: I’m collapsing it down to approximately “legacy” size.

The motivation, for the radical surgery that follows, can be seen quite plainly in the graph of inter-feature correlations, displayed in the cell “Out[7]” of analysis_and_tips.ipynb . Take a look at it! It shows 25 visually indistinguishable squares – no speck or squiggle out of place, in any of those 25 squares! Props to Numerai, for putting this redundancy front and center. They could very easily have obfuscated it.

For each “longRow”, of 1,050 int8 features (0, 1, 2, 3, or 4), I do:

shortRow = [sum( longRow[j : : 210] ) for j in range(210)]

Now I have 210 features, each with a value from 0 to 20 – which still fits in an int8.

Due to the high correlation, I wasn’t much surprised to find almost half the cells (in the 4.4M by 210 feature array) occupied by a whole multiple of 5 (i.e. 0, 5, 10, 15, or 20: what you’d get by adding 5 identical feature values). What did surprise me, was that 0 and 20, which can only be 0+0+0+0+0 or 4+4+4+4+4, each occupy 1/8 of the cells!

So if (like me) you have just a little more than enough RAM for the legacy tourney, you can probably squeeze in this 210 feature version of the super-massive. You won’t get much advantage (over legacy) feature-wise, but you will be in a significantly more “target rich environment” (a greater proportion of the rows are labeled with target values).

If you have abundant RAM, like 64 Gb and up, with a fleet of cores and GPU’s: maybe a 210-feature model could contribute to your grand ensemble.

6 Likes