Visualizing the New Data

So I’ve started looking at the new data, first by looking at the correlations between the features and the targets for the training set. There’s some interesting patterns showing up.

By way of process, what I did was take each era, Spearman correlate it with all the available targets. A small percentage of the targets, ~0.25%, had to be cooked (replaced with 0.5) as they were showing up as NaNs. There’s better ways, but that’ll do for now.

There’s 21 targets for each era, two of which (“target” and “target_nomi_20”)) are the same, so that gets double counted. Then, for each era, along each variable, the RMS value of the correlations is taken, and normalized to its median value across the 1050 variables, with values <1 set to one. It’s a bit of a hack, but quite useful for detecting patterns in data. The results are plotted on a log10 scale, from 0 to about 0.7.

The vertical coordinates are era (from top to bottom) and variable (left to right).

So here’s a jpeg of that:

Now I don’t know how clearly that plot will show up for the reader, but there’s some interesting artifacts. First off is the repeated patterns separated by about 210 bins horizontally. Those are clearest, at least for me, in the 5 thin vertical lines pretty much running continuously from top to bottom.

Here’s a blown up view:


which brings into focus more the way the peak correlation moves a bit from bin to bin – particularly in the area slightly to the right of centre, and the way it spreads, forming blocks or pulses. Maybe it’s time to resurrect my trackers :male_detective:

What’s it mean? :thinking:

Tomorrow maybe I’ll try the same for the Validation data.

3 Likes

Maybe you could try k-means clustering on the columns. It would be interesting to see how these features group together.

1 Like

“…repeated patterns separated by about 210 bins horizontally.”

See also: cell “Out[7]”, of analysis_and_tips.ipynb. It’s a colored plot of C[i, j] = (correlation of feature i, with feature j).

That plot looks for all the world, as if feature[i] = feature[i + 210] = feature[i + 420] = feature[i + 630] = feature[i + 840], for 0 <= i < 210. It’s close enough, that I can’t tell any difference by eyeball.

1 Like

I agree @rigrog, the modulo 210 pattern repetition is really quite interesting. It does appear as well in the Validation correlation plot:

though maybe not as clearly as, having fewer files, the image is more smeared.

The images are also floored (to the median response in each era). For completeness, here’s plots of those for the Train data:
NoiseFloorTrain

and then the Val data:
NoiseFloorVal

Anyway, I think the next step for me is to look at the 210 bin cycle, and what one can derive from that.

1 Like