Numerai Self-Supervised Learning & Data Augmentation Projects

Yes CV score on the training data. It is the Light GBM model. It’s all in the notebook in the link.

1 Like

Great, your average method is something like this [1710.09412] mixup: Beyond Empirical Risk Minimization ?

1 Like

Yes it appears so. What I tried would be like setting lambda = 0.5 for every example. I will try this mixup too. Here were the results from that trial. Didn’t seem really to improve. These MMC stats aren’t accurate, but I image it would improve the MMC.

Jeffery,

TC is simple: your model has to have some correlation with the target and it must be very different from other models.

Check out the screenshot I posted above. It has ~0 correlation with the example prediction. The baseline model has 0.64 correlation. A model based on umap is very different from other methods. Not better, but different. That’s the key!

Umap gives a different representation of the data, it brings extra information to the metamodel, hence it’s value!

5 Likes

Thanks for the explanation. Interesting thing that is a bit different from what Richard mentioned is that the models have relatively low FNC.

1 Like

Is TC actually relatively easy to achieve? I’m beginning to think so. Up until now we’ve been trying to make overall strong models but that tends to make them converge somewhat to be alike. Whereas with TC it seems we can make quirky original models that don’t necessarily have to be generally strong (i.e. get good or even positive corr as raw predictions) as long as some part of them is both relatively unique and that part of it is correlated with the target (which might not obvious). So like arbitrage used to say with Signals “just submit a feature” – if it is a good, original, useful feature that may be enough.

So I think making generally strong models that also get decent TC is tough because that’s essentially just one more thing we are trying to be good at on top of what we have already been trying to do (make generally strong models). But if we don’t have to make generally strong models to get decent TC (and we are allowed to stake only on TC and the payout is structured is so that it is worth it) then that’s a much easier proposition, and possibly easier than what we have been doing up until now. At least at the beginning – once everybody is going for TC maybe it won’t be so easy.

Because after all, the metamodel itself is the only one that really needs to be generally strong. Using TC as the main metric turns the whole competition into boosting (in more of way than it was already) in that the component models will probably generally become weaker. It remains to be seen whether it follows that the metamodel will then become stronger than it is under current metrics, or whether it will be about the same by a different route, or actually weaken.

6 Likes

Yes it is appearing TC is “easy” to achieve in that you don’t need to have particularly strong Corr/MMC to get high TC. This is working the way I thought MMC was supposed to originally, but I quickly learned that MMC only goes up if your Corr is in the top part of the range, that’s what it seemed to me at least. I think the catch here might be that you also have to stake Corr and/or MMC with the TC.

2 Likes

If they require corr staking as well then they’ll be throwing away all that “new” TC that they want to capture. I’ve got some high TC models but they are not good corr models really, so I wouldn’t stake on them unless I could only do TC. It sounds like they want to go 100% TC at some point (which is fine, but would be a mistake to do too fast). But we’ll see what happens when it happens – some balance between different metrics may be best.

3 Likes

I have added a new script to my ssl_numerai github repo to create synthetic stocks and targets from uniform noise with a deep generative model.
Training the LGBM baseline with 300,000 extra synthetic stocks improved sharpe from 0.715 to 0.718.
Lots of ways to improve this. Hot tip: make the model autoregressive over features. Happy generating!

3 Likes

I agree @wigglemuse , as corr staking is the easiest thing to draw in new marks wrap one’s head around. Letting the payout factor continue to decrease while limiting the multiplier for corr but letting the multiplier for TC (if it works out) be high pretty much resolves the issue, afaics.

2 Likes

I used the code provided my @mdo and created fake targets for the entire data set and ran it through my previously-mentioned pipeline cross validation pipeline, which is in fact the same pipeline offered in the example scripts from Numerai. I just wanted to share the results here. Of course the beauty of a method like that is that we could create multiple unique copies of the data, so I am currently working on that idea, where this trial was just 1 copy of the data. I was hopeful based on the example notebook, which showed an increase in correlation on the validation set just from training on the fake data. It appears getting a performance increase on the cross validation isn’t so simple. Here we see a possible benefit of combining the fake data with a copy of the real data set. Here are the Light GBM model params:

model_params = {“n_estimators”: 2000,
“learning_rate”: 0.01,
“max_depth”: 5,
“num_leaves”: 2 ** 5,
“colsample_bytree”: 0.1}

3 Likes

Siamese networks bring great results as well.

I created a proof-of-concept for siamese network on the “small” subset of the numerai dataset. The original idea comes from image similarity, but it can be adopted to the numerai dataset as well.
The basic idea is to learn an embedding based on the similarity of training examples.
Learning objective is that the euclidean distance of any two of the learned embeddings should be proportionate to the distance of the corresponding labels.

Results are the following:

Concatenating the original dataset with the newly learned embeddings improves all validation metrics.
Training a model on the embeddings only results in lower correlation with the metamodel and probably a higher TC. I don’t have it live yet.

Next task is to scale this up to the whole dataset.

Has anyone attempted a Numerai variant of DeepDream?
We could dream new training examples based on the embeddings learnt here…

4 Likes

Hey, that’s a pretty great result IMO. I’m actually quite impressed the fake targets by themselves can work as well as they do since they are usually only about 5% correlated with the original targets for an era. So many way to improve from here, e.g. ensemble models trained on many copies of data, train single models much longer on more data, etc…

1 Like

Yes I agree, it is pretty interesting. I’m about to do a trial where I use 3-4 batches of synthetic training data to see what happens. Also, to me the interesting thing is the data set of coefficients. This data set is 574 rows (# of eras) and 1050 columns (# of features). I put it through a t-SNE projection and it appears there may be some manifold structure there, hard to tell the significance, but cool to look at. I think significantly we can see that there appear to be 7-8 distinct clusters.

6 Likes

Very pretty! I assume the filament structures are ordered eras so it would be cool to color by era number just to see where the discontinuities lie. Makes me think jump-diffusion type models might be interesting here, but I don’t really know much about them. Finding better ways of modeling and sampling that manifold might also work better than a GMM.

2 Likes

The basic idea of tSNE is in low dimensional embedding both local and global distances are somewhat preserved. points that appear close together in the image should also be close together in the raw data and vice versa. I don’t think we can say anything about the ordering, but I will check. We only have 574 data points in 1050 dimensional space. I think it deserves further investigation. One extension could be to swap out a feed forward NN for the
ridge regression but then our coefficients would have much larger dimensionality than just 1050.

1 Like

I think you will find that the points next to each other in space are also next to each other in time.

2 Likes

Regression weights from adjacent eras are likely to be similar simply because the targets are based on 75% overlapping return data, and thus are likely to map to similar points in space. The filament breaks are then potentially good indicators of regime changes.
The problem with replacing linear weights with a NN is that the weights from NNs trained on different eras will have no natural correspondence and consequently a dimensionality reduction on these weights makes no sense. If this is not obvious, just consider that you can change the ordering of hidden units (and appropriately swap their corresponding input and output weights) without changing the function the NN computes at all. A polynomial or spline expansion could work, but that would create lots more weights.

2 Likes

Super interesting. I’m using google’s tSNE embedding projector. Works great with a small data set like what we have. They didn’t have away to color the points but I can label them by era… It seems both you and @wigglemuse were right!

10 Likes

Here are the results of a trial using 4 copies of the data set with synthetic targets based on @mdo’s code (the targets are different for each copy). We see correlation (and MMC) now beating the raw data however sacrificing variance.

2 Likes