Numerai Self-Supervised Learning & Data Augmentation Projects

Ok, some progress that warrants an update:

Re the cov-embedding:

  • Higher dimensions do not help at all.
  • inverse transforms do not generally yield valid (i.e. positive semi-definite) covariance matrices. This basically means you want to be really careful about the maths of any of this.

Re the “cov as a model or regularizer of a model” topic:
Note that covariance itself does not tell you anything about the shape of the distribution. However, when sampling the cov matrix, you have to know the shape of the distribution you want your samples to be in. In numerai data, feature distribution is uniform, but targets seem normal.
To answer @jefferythewind: you can use numpy.random.multivariate_normal() to make normally distributed samples with your cov matrix. You may write your own function to get different distributions. Also to your point: there seems to be something very peculiar to cov matrices. I tried embedding only its singular values or SVD decomposition, and others. None form any structure either. With covs, however, I could not find any parameters that do not yield a smooth string, even in higher dimensions.

On V4, I could not get anything that outperforms the vanilla data yet (something something signal decay on longer valid period?), but you still get significant decorrelation to the example predictions.

2 Likes

Is this considered solved/closed? I started working on this recently and wanna know if it’s still worth working on or not

1 Like

This is very much an open area of research!

1 Like

Haha, I was surprised to find my forum post in the investor presentation yesterday @richai :slight_smile:
For the record: I never managed to bring the deepdream idea to a statisfactory level. But I found another NN architecture that produces good augmented rows and have produced good results in the past months.

3 Likes

would that be a diffusion model?

haven’t read the entire thread but wanted to throw out the idea that when I was last active in the main tournament I did a pretty brainless data augmentation approach. I don’t remember exactly but I think I took 1000 random pairs of features (modifying the values so no 0’s to throw the logs) and made new features by taking one log the base of the other, and doing some sensible transformations after and then fit that data (joined with the original data). my model Urza did pretty well, not elite I think, but my point is fairly random and brainless data augmentation seemed to squeeze some extra juice out of the data available at the time (pre big data expansion), no academic exploration exerted.

1 Like

How long should the track record be for a model trained only on synth data?

Is there any analysis or examples of why PCA is not good for Numerai? Especially when Factor Analysis seems to work well for @katsu1110 [numerai] factor analysis | Kaggle

If I recall, I think it’s related to PCA working on linear relationships and just linear stuff doesn’t perform very well on this dataset (but who knows if there is people out there having more success with it).

1 Like

If you are just targeting TC, throw all such advice out the window. But with CORR, yeah have had a hard time with pretty much any wholesale data transformations. (Augmentation working better than replacement.)

1 Like

If anyone is interested I developed a data generation process akin to the the tournaments data. This model sees no real data. Very little meta model corr and decent corr, and very promising seeming TC. See the model starq_synth2 model here also starq_synth here (starq_synth seems to be more volatile, this model is only trained on some super old training data from a while ago)

1 Like

I don’t want to downplay your work as it could still be a very good model, but in my experience it takes way longer than one resolved era to fully assess the capabilities of a model. After a few months of resolved eras the picture will be more clear.

totally agree, although tbf I asked above how many rounds and no one said anything :laughing:

Rather than embedding singular values of the cov matrix, it may worth trying to embed the square roots of the cov matrix since even if the square root is not symmetric, the sampled “square roots of cov matrices” times the transpose of such sampled matrices should still be symmetric. On the other hand if you were trying to use something like a DAE rather than just umap for embeddings, you could create some sort of “asymmetry error” to do unsupervised encoding or decoding training and then fine tune on some held-out real data.

starq_synth is trained on synthetic data, scored against some super old v4 data and randomly gets 99th percentiles. p interesting