Visualizing the New Data

gammarat · September 10, 2021, 2:53am

So I’ve started looking at the new data, first by looking at the correlations between the features and the targets for the training set. There’s some interesting patterns showing up.

By way of process, what I did was take each era, Spearman correlate it with all the available targets. A small percentage of the targets, ~0.25%, had to be cooked (replaced with 0.5) as they were showing up as NaNs. There’s better ways, but that’ll do for now.

There’s 21 targets for each era, two of which (“target” and “target_nomi_20”)) are the same, so that gets double counted. Then, for each era, along each variable, the RMS value of the correlations is taken, and normalized to its median value across the 1050 variables, with values <1 set to one. It’s a bit of a hack, but quite useful for detecting patterns in data. The results are plotted on a log10 scale, from 0 to about 0.7.

The vertical coordinates are era (from top to bottom) and variable (left to right).

So here’s a jpeg of that:

Now I don’t know how clearly that plot will show up for the reader, but there’s some interesting artifacts. First off is the repeated patterns separated by about 210 bins horizontally. Those are clearest, at least for me, in the 5 thin vertical lines pretty much running continuously from top to bottom.

Here’s a blown up view:

which brings into focus more the way the peak correlation moves a bit from bin to bin – particularly in the area slightly to the right of centre, and the way it spreads, forming blocks or pulses. Maybe it’s time to resurrect my trackers

What’s it mean?

Tomorrow maybe I’ll try the same for the Validation data.

jacob_stahl · September 10, 2021, 3:15am

Maybe you could try k-means clustering on the columns. It would be interesting to see how these features group together.

rigrog · September 10, 2021, 3:25am

“…repeated patterns separated by about 210 bins horizontally.”

See also: cell “Out[7]”, of analysis_and_tips.ipynb. It’s a colored plot of C[i, j] = (correlation of feature i, with feature j).

That plot looks for all the world, as if feature[i] = feature[i + 210] = feature[i + 420] = feature[i + 630] = feature[i + 840], for 0 <= i < 210. It’s close enough, that I can’t tell any difference by eyeball.

gammarat · September 10, 2021, 2:37pm

I agree @rigrog, the modulo 210 pattern repetition is really quite interesting. It does appear as well in the Validation correlation plot:

though maybe not as clearly as, having fewer files, the image is more smeared.

The images are also floored (to the median response in each era). For completeness, here’s plots of those for the Train data:

and then the Val data:

Anyway, I think the next step for me is to look at the 210 bin cycle, and what one can derive from that.

Topic		Replies	Views
Generating Feature Groups Data Science	19	3352	September 16, 2022
Feature Timing, Predicting When Features Will Work Data Science	8	3374	November 21, 2021
Performing Exploratory Data Analysis on Numerai Tournament Data with R Data Science	3	6427	December 2, 2021
Removing Dangerous Features Data Science	23	4842	August 30, 2022
Strange correlation behavior Data Science	2	957	January 22, 2022

Visualizing the New Data

Related topics