Generating Feature Groups

I’ve been working on a method to group features together like the old dataset
by making a correlation matrix with the training set, and clustering the columns together with k-means. This groups features together if they have similar behavior. I also tried doing this recursively by repeating the process with each new group to find sub groups.

The full experiment is in this notebook

Csv with feature groups here

I haven’t made any models that use these groups yet, but I’m curious if any of you would find
this useful.


I like the idea and I’ve been playing with that too. I never tried clustering features using kmeans of the corr matrix though. Any motivation for kmeans of the corr matrix?
I did try clustering with some homegrown methods though. When I add mean, std, etc per group as additional features, the resulting models always show improvements OOS. What I did to cluster features was something like:

  1. two features that correlate above some threshold are defined as neighbors
  2. features groups are neighborhoods

You could also try linear regression coefficients instead of corr values to define neighbors, seems to give interesting results as well.

When I first looked at the dataset’s correlation matrix I noticed a repeating diagonal pattern that suggested every ~200 features were similar in some way. I wanted find a way to rearrange the columns to make that pattern go away. Initially I looked at it as a kind of sorting problem and it evolved from there.

It seems to be 210, and as that’s 1/5 of 1050, as a first guess I assume Numerai is collecting 210 features for each stock each day, and then gluing 5 days of those together to form a single row.

Yes, 210 by my count too, the corr matrix seems to be very periodic. I like gammarat’s interpretation, at least that would make sense.

1 Like