Numerai Self-Supervised Learning & Data Augmentation Projects

Just wanted to mention here that we are more interested in understanding the structure of the data and not in submitting linear model predictions trained on recent data. If we want to do that we are free to try it in Signals. In fact I think some top models there have employed various versions of simple momentum/mean reversion etc. All based on recent data each week. What’s interesting to me is to think about how the least square solutions (the regression weights) define the structure of an era to a certain extent. Summarizing the relationships in thousands of stocks.

Hmmm, is this Numerai in a few years?

(The Raft of the Medusa, Gericault, 1818-1819).
:laughing:

3 Likes

image

me

9 Likes

Here is a trial with 10x data with synthetic targets.

Update
Here is a trial with 6x fake data, and for the “both” trial I combined that with 6 copies of the training data, this seemed to balance out the affect of the fake data when combining the predictions. Attaining a mean correlation of 6% on this cross validation trial.

1 Like

Looking great. Would love to see some validation metrics on these too!

Might also want to check out https://pymde.org/
PyMDE is based on a simple but general framework for embedding, called Minimum-Distortion Embedding (MDE). The MDE framework generalizes well-known methods like PCA, spectral embedding, multi-dimensional scaling, LargeVis, and UMAP. With PyMDE, it is easy to recreate well-known embeddings and to create new ones, tailored to your particular application.
(From the guy behind cvxpylayers!)

1 Like

Would be cool to make a generative model of these filaments and then use it to sample fake filaments in the embedding space. Then you could reverse the embedding transform to get the corresponding weights which could be used to make fake targets. This fake data might be more realistic than the GMM samples.

1 Like

Yes I just did a trial using 2x the fake targets with 1x real data on the entire large data set. Using 6x to 10x data wasn’t feasible in my setup with the entire data set. The cross validation was downsampling to every fifth row of data. The validation metrics look worse than the baseline model with only real data, unfortunately. It isn’t the first time I’ve seen something improve cross validation correlation without improving validation performance.

Just Real Data - Baseline

** 2x Fake Targets with 1x Real Data **

image

2 Likes

This truism goes all the way back to RiskMetrics and the EWMA…
https://www.msci.com/documents/10199/d0905614-2771-46dc-b000-1a033146586a

1 Like

But we’re not actually talking about time-series data here. We can’t do time-series in the main tournament – not directly anyway. (And of course if you do Signals you can use all the most recent data because it is your data. But using the recent data as a necessary input is different from basing your model on the recent past.) With the data we’re given, we’re making models of the whole market as it were – predicting how we think markets are gonna go when the data looks like this. So that’s different. Obviously if you are picking the future of a specific stock you want to know what it has been doing lately, and general current info about the stock. And presumably we are getting that in each row of the data given – we don’t know quite what it means except that in some sense it represents the current state of the stock. So when we debate about whether we need the most recent data possible in the main tournament, we’re not talking about following along specific stocks and trending them out. (Again, presumably that data is more or less in the feature row given for that stock.) The debate is: Is the general example of recent era X as a market model better (as a market model) that less recent era Y simply because era X is more recent?

i.e. Do training eras become less useful for training the more they recede into the past simply because they’ve receded more into the past and they are teaching lessons that simply are no longer relevant (or less relatively relevant), and the most recent eras are therefore by definition the most useful / more relevant? And I think we all the know the basic answer – the most recent eras ARE more useful MOST OF THE TIME. And we can see it in that visualization – although we do see changes and ups and downs even in the connected strings (unclear what that means in practice for results). BUT, there are sudden gaps where what is happening now doesn’t seem connected at all to the recent past – those are the points where you are going to get burned from over-reliance on recent eras.

If you could have either have the last 1 year of data and that’s all, or 10 years of data but nothing from the last year, which would you choose? These are the kind of questions I’m muddling over here.

Nobody is saying don’t use recent data (at all) if you got it. But if somebody tells me they MUST have the recent data, they are also telling me they are going to weight it heavily (or else why is it so important?). And I think that is inevitably going to lead to less well-rounded more superficial models exhibiting streaky success punctuated by fairly large sudden failures. And then what do you do in those transition periods where the market has obviously suddenly shifted and now your recent past data is not helping you at all and you don’t have any newer data yet that corresponds to the current regime? After such a failure, do you then fall back onto a more general well-rounded model you’ve been holding in reserve while waiting for a new trend to establish itself?

So the main question is – is it worth it? If the whole metamodel moves in the direction of recency being more and more heavily weighted – and when there is a decently long streak where a regime is holding steady and recency is winning it surely will move in that direction, of course it will --then it also follows that when that regime falls apart (which is often sudden), that the resulting drawdown is going to be bigger than it would have been otherwise. The drawdowns are what kill you. But maybe the previous gains will have been worth it? Maybe, but maybe not, especially since a lot of users will then be sitting on useless models (for a time) that are going to weight the now-not-so-instructive recent past and will either keep submitting bad predictions (for a time) or will have to switch to a fallback, or pull stakes, or something.

I’m well aware I’m probably overstating the dangers here to make my point, but these issues are actually real, and I have actually seen it a million times in various contexts where people are betting on things. It is a pretty reliable pattern, and it just seems like a recipe for metamodel volatility.

tl;dr I don’t trust trend following – it isn’t prediction, it is just following along until failure occurs.

1 Like

lol( 20 characters )

Here is Richard when this is all over for everyone commenting on this post
image

1 Like

Trend following, as well as mean reversion, inherently requires a prediction that the current regime will continue. The notion of position sizing is a further testament to your main concern, that trends end and nothing persists ad infinitum. None of this negates the importance of current market conditions represented via the most recent data.

Ironically, this was the rationale behind my vocal Signals suggestion of maintaining both the 6d and 20d targets, and is quasi addressed with the multiple classic targets now.

Hi @richai

Google’s DeepDream inspired me to try to “dream” new rows for the dataset.

The idea behide DeepDream is to create a “dream” image that maximizes output of certain layers by gradually modifiying the original image. This is gradient ascent insted of decent.
The resulting image is similar to the original, but maximizes certain activations in the network.

Can we use a similar method to create a “dream” version of the original dataset that actually increases perfomance? I’ve just did that!

My prodecure was the following:

  1. Train a wide-and-deep NN with the dataset
  2. Select an appropriate layer to maximize activations for
  3. Calculate an input vector (row) that maximizes activation of a selected layer by iteratively adding the gradient to the input
  4. Repeat for all rows in the training set.

I used a scaled down version of the dataset (“medium” featureset) for this proof-of-concept.

Results are the following:

XGB baseline:

XGB model with extended dataset (10% increase of row count):

Now, there is clearly more research to be done with this idea!
The extended dataset improves all metrics by a small margin. Hope that using the full dataset will do better. Also increasing the dataset size by 10% helps, but adding more doesn’t improve results any further.

Still I find the results for a quick proof-of-concept promising.

6 Likes

Very cool! I’m curious if you tried different layers to maximize activations for and what effect they had. Lots of possible extensions there too, i.e. maximizing activations for individual hidden units or subsets of hidden units, etc. Also was the 10% of rows that you created alternate versions of randomly selected? Also were imagined rows rebinned or just kept as is after the “dream”?

1 Like

I experimented with different NN architechtures and layer selections. Some layers improve results to various degree, some don’t. There is still a lot to be tried, including individual units. The method basically generates new rows similar to the source row, where similarity is defined by the learnt parameters of the NN.

I created a “dream” version of the whole training set and randomly sampled 10% of it. This selection is then concatenated to the training set. I trained a XGBoost model on both datasets (with and without dream rows) to compare results.
I didn’t rebin the “dream” rows. They are used as is. I only cut them down to the [0-1] range to keep the same scale.

These are still a very early results. Just started working on it yesterday, but it looks promising and I can’t keep my mouth shut :smiley: I need to implement a good cross validation to validate results. But it’s easier said then done, with this pipeline. I only used the standard validation dataset for now.

3 Likes

Hi,

some update on deaming. I opensourced my solutions. You can find it here:

With some experimentation I managed to achive somewhat better results:

  • XGB baseline (some parameters are updated, thus the different baseline now)

  • XGB model on the extended dataset

You can generate these files by running my example script.

It’s worth noting tough that this is just a more sophisticated way to add noise to the dataset, which is better(?) than gaussian noise. It helps with regularization, but unlike the previous solution I posted here, it doesn’t add any new information that helps the model learn.

It would be interesting to also experiment with other NNs. Anyone willing to contribute with a well trained keras model? Let me know!

2 Likes

I like that boost in sharpe, from my experience it seems easier to generalize an increase in sharpe than just corr. Also impressive if this increase is from just 10% more data. I will see if I can contribute something to your work. Recently I’ve been struggling trying to train pytorch models. For some reason it seems my NNs perform especially bad on the validation set.

This is great @nyuton! I actually think this method might be closer to true data augmentation (somewhat akin to using randomly rotated, color shifted, and zoomed versions of images for training an image classifier) than adding noise. The line between the two ideas isn’t super clear, but lots of recent progress on various benchmarks has been due to better data augmentation strategies, so I find your line of work here rather compelling.