Numerai Self-Supervised Learning & Data Augmentation Projects

jefferythewind · March 17, 2022, 12:13am

Yes CV score on the training data. It is the Light GBM model. It’s all in the notebook in the link.

maxchu · March 17, 2022, 12:14am

Great, your average method is something like this [1710.09412] mixup: Beyond Empirical Risk Minimization ?

jefferythewind · March 17, 2022, 12:18am

Yes it appears so. What I tried would be like setting lambda = 0.5 for every example. I will try this mixup too. Here were the results from that trial. Didn’t seem really to improve. These MMC stats aren’t accurate, but I image it would improve the MMC.

nyuton · March 17, 2022, 7:57am

Jeffery,

TC is simple: your model has to have some correlation with the target and it must be very different from other models.

Check out the screenshot I posted above. It has ~0 correlation with the example prediction. The baseline model has 0.64 correlation. A model based on umap is very different from other methods. Not better, but different. That’s the key!

Umap gives a different representation of the data, it brings extra information to the metamodel, hence it’s value!

jefferythewind · March 17, 2022, 12:44pm

Thanks for the explanation. Interesting thing that is a bit different from what Richard mentioned is that the models have relatively low FNC.

wigglemuse · March 17, 2022, 1:35pm

Is TC actually relatively easy to achieve? I’m beginning to think so. Up until now we’ve been trying to make overall strong models but that tends to make them converge somewhat to be alike. Whereas with TC it seems we can make quirky original models that don’t necessarily have to be generally strong (i.e. get good or even positive corr as raw predictions) as long as some part of them is both relatively unique and that part of it is correlated with the target (which might not obvious). So like arbitrage used to say with Signals “just submit a feature” – if it is a good, original, useful feature that may be enough.

So I think making generally strong models that also get decent TC is tough because that’s essentially just one more thing we are trying to be good at on top of what we have already been trying to do (make generally strong models). But if we don’t have to make generally strong models to get decent TC (and we are allowed to stake only on TC and the payout is structured is so that it is worth it) then that’s a much easier proposition, and possibly easier than what we have been doing up until now. At least at the beginning – once everybody is going for TC maybe it won’t be so easy.

Because after all, the metamodel itself is the only one that really needs to be generally strong. Using TC as the main metric turns the whole competition into boosting (in more of way than it was already) in that the component models will probably generally become weaker. It remains to be seen whether it follows that the metamodel will then become stronger than it is under current metrics, or whether it will be about the same by a different route, or actually weaken.

jefferythewind · March 17, 2022, 4:06pm

Yes it is appearing TC is “easy” to achieve in that you don’t need to have particularly strong Corr/MMC to get high TC. This is working the way I thought MMC was supposed to originally, but I quickly learned that MMC only goes up if your Corr is in the top part of the range, that’s what it seemed to me at least. I think the catch here might be that you also have to stake Corr and/or MMC with the TC.

wigglemuse · March 17, 2022, 4:26pm

If they require corr staking as well then they’ll be throwing away all that “new” TC that they want to capture. I’ve got some high TC models but they are not good corr models really, so I wouldn’t stake on them unless I could only do TC. It sounds like they want to go 100% TC at some point (which is fine, but would be a mistake to do too fast). But we’ll see what happens when it happens – some balance between different metrics may be best.

preparedzebra · March 17, 2022, 5:21pm

I have added a new script to my ssl_numerai github repo to create synthetic stocks and targets from uniform noise with a deep generative model.
Training the LGBM baseline with 300,000 extra synthetic stocks improved sharpe from 0.715 to 0.718.
Lots of ways to improve this. Hot tip: make the model autoregressive over features. Happy generating!

gammarat · March 17, 2022, 7:25pm

I agree @wigglemuse , as corr staking is the easiest thing to ~~draw in new marks~~ wrap one’s head around. Letting the payout factor continue to decrease while limiting the multiplier for corr but letting the multiplier for TC (if it works out) be high pretty much resolves the issue, afaics.

jefferythewind · March 18, 2022, 3:34pm

I used the code provided my @mdo and created fake targets for the entire data set and ran it through my previously-mentioned pipeline cross validation pipeline, which is in fact the same pipeline offered in the example scripts from Numerai. I just wanted to share the results here. Of course the beauty of a method like that is that we could create multiple unique copies of the data, so I am currently working on that idea, where this trial was just 1 copy of the data. I was hopeful based on the example notebook, which showed an increase in correlation on the validation set just from training on the fake data. It appears getting a performance increase on the cross validation isn’t so simple. Here we see a possible benefit of combining the fake data with a copy of the real data set. Here are the Light GBM model params:

model_params = {“n_estimators”: 2000,
“learning_rate”: 0.01,
“max_depth”: 5,
“num_leaves”: 2 ** 5,
“colsample_bytree”: 0.1}

nyuton · March 18, 2022, 4:55pm

Siamese networks bring great results as well.

I created a proof-of-concept for siamese network on the “small” subset of the numerai dataset. The original idea comes from image similarity, but it can be adopted to the numerai dataset as well.
The basic idea is to learn an embedding based on the similarity of training examples.
Learning objective is that the euclidean distance of any two of the learned embeddings should be proportionate to the distance of the corresponding labels.

Results are the following:

Baseline XGB model on the small dataset:

image712×540 36.3 KB
XGB model on the extended dataset (small subset + learned embeddings)

image708×538 36.2 KB
XGB model on the learned embeddings only

image710×543 38.3 KB

Concatenating the original dataset with the newly learned embeddings improves all validation metrics.
Training a model on the embeddings only results in lower correlation with the metamodel and probably a higher TC. I don’t have it live yet.

Next task is to scale this up to the whole dataset.

Has anyone attempted a Numerai variant of DeepDream?
We could dream new training examples based on the embeddings learnt here…

mdo · March 18, 2022, 8:20pm

Hey, that’s a pretty great result IMO. I’m actually quite impressed the fake targets by themselves can work as well as they do since they are usually only about 5% correlated with the original targets for an era. So many way to improve from here, e.g. ensemble models trained on many copies of data, train single models much longer on more data, etc…

jefferythewind · March 18, 2022, 9:58pm

Yes I agree, it is pretty interesting. I’m about to do a trial where I use 3-4 batches of synthetic training data to see what happens. Also, to me the interesting thing is the data set of coefficients. This data set is 574 rows (# of eras) and 1050 columns (# of features). I put it through a t-SNE projection and it appears there may be some manifold structure there, hard to tell the significance, but cool to look at. I think significantly we can see that there appear to be 7-8 distinct clusters.

mdo · March 18, 2022, 11:29pm

Very pretty! I assume the filament structures are ordered eras so it would be cool to color by era number just to see where the discontinuities lie. Makes me think jump-diffusion type models might be interesting here, but I don’t really know much about them. Finding better ways of modeling and sampling that manifold might also work better than a GMM.

jefferythewind · March 19, 2022, 1:42pm

The basic idea of tSNE is in low dimensional embedding both local and global distances are somewhat preserved. points that appear close together in the image should also be close together in the raw data and vice versa. I don’t think we can say anything about the ordering, but I will check. We only have 574 data points in 1050 dimensional space. I think it deserves further investigation. One extension could be to swap out a feed forward NN for the
ridge regression but then our coefficients would have much larger dimensionality than just 1050.

wigglemuse · March 19, 2022, 2:07pm

I think you will find that the points next to each other in space are also next to each other in time.

mdo · March 19, 2022, 3:57pm

Regression weights from adjacent eras are likely to be similar simply because the targets are based on 75% overlapping return data, and thus are likely to map to similar points in space. The filament breaks are then potentially good indicators of regime changes.
The problem with replacing linear weights with a NN is that the weights from NNs trained on different eras will have no natural correspondence and consequently a dimensionality reduction on these weights makes no sense. If this is not obvious, just consider that you can change the ordering of hidden units (and appropriately swap their corresponding input and output weights) without changing the function the NN computes at all. A polynomial or spline expansion could work, but that would create lots more weights.

jefferythewind · March 19, 2022, 4:07pm

Super interesting. I’m using google’s tSNE embedding projector. Works great with a small data set like what we have. They didn’t have away to color the points but I can label them by era… It seems both you and @wigglemuse were right!

jefferythewind · March 20, 2022, 1:26am

Here are the results of a trial using 4 copies of the data set with synthetic targets based on @mdo’s code (the targets are different for each copy). We see correlation (and MMC) now beating the raw data however sacrificing variance.

Topic		Replies	Views
Super Massive Data: Sunshine Announcements	24	7815	March 23, 2023
Super Massive Data Release: Deep Dive Data Science	81	21434	November 22, 2021
Learning Two Uncorrelated Models Data Science	16	6536	September 9, 2020
AutoEncoder and multitask MLP on new dataset (from Kaggle Jane Street) Data Science	30	9507	March 7, 2022
Era Boosted Models Data Science	21	15254	October 10, 2021

Numerai Self-Supervised Learning & Data Augmentation Projects

Related topics