Numerai Self-Supervised Learning & Data Augmentation Projects

Getting NNs to perform nearly as well as a GBM model is extremely challenging. NNs are super finicky and love to overfit in these low SNR, small data size regimes.

2 Likes

I agree, and when I try techniques to keep them from over-fitting I find i hard to make progress during training, but I think that is to be expected.

I really like how much activity we have on this topic. @nyuton are you coming to NumerCon on April? We would love to help you make it.

For anyone else who has contributed here and wants to come, we or the CoE can help you get there.

Lots of Kaggle grandmasters will be there too: NUMERCON • Numerai Conference 2022 • Tickets, Fri, Apr 1, 2022 at 1:00 PM | Eventbrite

2 Likes

Yeah i’ve been working on this idea. Unfortunately it doesn’t seem possible to do an inverse projection in tSNE, however with UMAP, yes. So I’ve been working on this idea.

One other method for data augmentation is to train an NN with two objectives:

  • preserve information that predicts labels
  • output new rows that minimizes distance to the source row based on the preserved information.

I drew inspiration from the Unet architectre that is used for image segmentation tasks. The result is a two-headed NN.

  • It has one output in the middle to predict target labels from the narrow hidden layer. It forces the first part of the NN to learn parameters that can predict the label.
  • Then there are upscaling layers which get information from the first layers as well. The second output layer has the same dimension as the input to create new augmented(?) rows.

Model architecture:

def getModel():
    
    input_layer = tf.keras.layers.Input(shape=(n_features,))

    x = tf.keras.layers.Dense(420, activation='relu')(input_layer)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.Dropout(0.1)(x)

    x = tf.keras.layers.Dense(128, activation='relu')(x)
    x = tf.keras.layers.BatchNormalization()(x)    
    x0 = tf.keras.layers.Dropout(0.1)(x)

    o = tf.keras.layers.Dense(16, activation='relu')(x0)
    o = tf.keras.layers.BatchNormalization()(o)
    o = tf.keras.layers.Dropout(0.1)(o)

    # first head output
    label_output = tf.keras.layers.Dense(1, activation='sigmoid', name='label_output')(o)

    # decoder layers
    x = tf.keras.layers.Dense(128, activation='relu')(o)
    x = tf.keras.layers.BatchNormalization()(x)    
    x = tf.keras.layers.Dropout(0.1)(x)
    
    x = tf.keras.layers.concatenate([x, x0], name='concat')

    decoder_output = tf.keras.layers.Dense(420, activation='sigmoid', name='decoder_output')(x)
     

    model = tf.keras.Model(input_layer, [label_output, decoder_output])
    model.compile(optimizer=tf.optimizers.Adam(0.001), loss='mse', loss_weights=[1, 3])
    
    return model

Results are similar to the previously detailed “dream” approach.
While the improvements are not too big, these methods show that it is possible to generate augmented rows, contrary to many discussions in rocket chat. Also my scripts are in a proof-of-concept stage, with a lot of room for improvement.

  • Baseline

  • Augmented with +5% data

1 Like

@nyuton This method reminded me a bit to this one AutoEncoder and multitask MLP on new dataset (from Kaggle Jane Street)

2 Likes

Has anyone ever worked with the Synthetic Data Vault(SDV) before or thought about using these libraries in this application?
The * CTGAN Model is a GAN-based Deep Learning data synthesizer that can generate synthetic tabular data with high fidelity according to the site
They also have the PAR model which is an implementation of a Probabilistic AutoRegressive model that allows learning multi-type, multivariate timeseries data and later on generate new synthetic data that has the same format and properties as the learned one.

Something like this below will create a data frame around just era1 for example with only the feature columns seleted and create synthetic data of all features of just era1. Im sure you could do it to multiple selected eras and also selecting out individual whole features

import pandas as pd
import gc

from numerapi import NumerAPI
from halo import Halo


napi = NumerAPI()
spinner = Halo(text='', spinner='dots')
current_round = napi.get_current_round(tournament=8)  # tournament 8 is the primary Numerai Tournament
download_data(napi, 'numerai_training_data.parquet', 'numerai_training_data.parquet', round=current_round)
spinner.start('Reading parquet data')
training_data = pd.read_parquet('numerai_training_data.parquet')
spinner.succeed()

training_data.head()


features = [c for c in training_data if c.startswith("feature")]
print(len(features))

era1 = training_data.loc[training_data['era'] == '0001']

era1.head()


era1_feature_columns = era1.loc[:, era1.columns.str.startswith("feature")]

era1_feature_columns.head()



from ctgan import CTGANSynthesizer

ctgan = CTGANSynthesizer(verbose=True)
ctgan.fit(era1_feature_columns, features, epochs=10)

ctgan_synthetic_data = ctgan.sample(2070)
ctgan_synthetic_data.head()


image

2 Likes

Also wanted to add that I think NumerFrame | numerblox can be very helpful in this project for cutting DFs up

2 Likes

I guess we’ll keep the flow going here. I tried generating entirely new rows of data, including all 20 target columns through a completely unsupervised method based on UMAP. It is different from what @nyuton described above, however interestingly results look similar: super low correlation with metamodel and also some positive corr… the result may be some decent TC.

The idea is to take all the features and targets and embedded them into a low-dimensional space with UMAP. So the input here is 1070 dimensions, outputting to 2 dimensions. We then sample uniformly from the embedded space and leverage the inverse transform to then create entire synthetic rows of data.

The technique is exactly what was described in this tutorial Inverse transforms — umap 0.5 documentation Here is a shot showing how one can generate synthetic MNIST data.

The embedded geometry of the numerai data isn’t much to look at, could be tuned more.

Training on only on the generated 10,000 rows give the following looking validation stats.

It takes a while to generate the rows. I was hoping to see Rapids AI supply an inverse-transform function, since that is real fast for computing UMAP in the forward direction, however it does not have this function. Interesting to see how similar the stats look to the technique based on embedding the input data and using that low dimensional data to train the model, maybe not so surprising.

1 Like

Less is more.

We’re talking data augmentation. Instead of adding more features, why not take them away? Maybe it is something people have tried before. I tested removing 1 feature at a time individually from the data set (for all 1050 features) and running a full cross validation on the training set. This way we could see if removing a feature increases performance across the board. While it isn’t so sophisticated, it seems to work. Here was have results, showing Sharpe and Corr increases when removing many of the features on CV of the training set.

Compare to the original

and I was able to then verify this improvement carries over to the validation set even just removing 1 feature. Here is the baseline training, following by a trial where I removed the top 5 features on the list. I find it pretty compelling.

3 Likes

Yes, I’m coming! See you there @richai

4 Likes

Better late than never!

My models are all based on representation learning techniques.

I won the TPS January Challenge on Kaggle by using DenoisingAutoEncoders (DAEs) + representation learning.
Here is a short summary

All my numerai models use different 1st level autoencoder architectures like (deepstack, bottleneck, transformer based AE) as well as different noise effects to learn as much information of the dataset as possible. The learned weights of the DAE are then used for the final training.

Here are a few examples:
example model 1
example model 2
example model 3

Cheers,
danzell

Edit: links

8 Likes

Has anyone tested Sharpened Cosine Similarity?

I am quite fascinated by the idea of creating an alternate version of history in this context. My insights so far:
It is useful to look at the covariance matrices per era. There is a sense in which they are continuous in time, as @jefferythewind showed with tSNE. I used a UMAP embedding, but it gives the same kind of picture. I then picked a random point in the trajectory in the embedded space and went off along a new path. Visually it looks like this:


Some remarks on this:

  • this is v3 data
  • the covariance matrix includes the target, so you can sample labeled data
  • there is a gap in the trajectory of trainig eras around era 100. My bet is that there is a time jump here
  • the trajectory seems to be some kind of random walk with inertia, so thats what I used as model to make a new one.
  • I like colors

Now, thanks to inverse_transform you can get new covariance matrices of completely synthetic eras for each point along the new trajectory. You can use those to get new labeled samples. So for the picture above, this gives you 200 new eras.

Below a summary sorted by sharpe, comparing LGBM models on data as-is (vanilla) and new data (synthetic), plus neutralized versions

TBH, it absolutely blows my mind that training on these samples has any predictive power at all. Like, you can just make stuff up that isn’t real and use that to make better decisions in real life?
I experimented with other means of cooking up new covariance matrices, some of which seem even more promising. Super curious about live performance.

5 Likes

Very cool @slowmoe, that’s exactly the sort of thing I was imagining. A few things I would try are:

  1. just use upper (or lower) triangular of covariance matrix if you aren’t already to reduce dimension and prevent double counting off diagonal
  2. include the additional targets in the covariance matrix so the embedding space includes more feature to target information
  3. use a higher than 2 dimensional embedding space (and then a path in that space) to retain more information
  4. use v4 data and then validate on the much longer test set you now have targets for

And there isn’t a time jump around era 100, could just be a major regime change. Interested to see if it breaks in the same place in higher dimensions.

3 Likes

Although non-intuitive when you come at it from a “this is random data I just made up” angle, it does make perfect sense that it has predictive power. The covariance matrix itself is a model of the data, and so therefore is the fake data created from that model. And then you train a new model from that fake data and predict stuff – it is natural that it would retain a fair amount of predictive power even if somewhat watered down. But…is it going to be valuable alpha? Probably not – it seems to be more of a regularizer which I think is true of most of these types of methods. They are pure stats if that makes any sense. (Denoisers, essentially.) I immediately become interested in the stuff that is lost – is it just noise or is that where the real gold is? (Gold that will take more than summary stats to uncover.) I’d like to try using the covariance matrix approach as a neutralizer and then train on what’s left over and see if that holds anything interesting.

4 Likes

happy to hear that the great @mdo identified the same next steps as I did :smiley:

On 2) I can already say that it helps (unsurprisingly) a good deal. 3) and 4) are on hold for now because I am running on a Dell laptop over Easter holidays… I hope to post an update on those once I have something.

Just as @wigglemuse hinted to, I’ve come around to viewing this as a regularization technique. However, one observation makes me believe there is more to it: This particular model (covariance matrix) is in a sense evolving continuously in time. This suggests that if you know the recent past, you can rule out most of the space of possible states of the near future. That seems like a big deal to me.

2 Likes

Would it make sense to apply UMAP per era?

Thank you! – and some nice TC on those models.

1 Like

@slowmoe, very cool! That was a cool idea to use the covariance matrix as a representation of the era dynamics. It seems you haven’t completely described the data generation procedure, once you have the new covariance matrix for a new era? Or I haven’t picked it up. I had envisioned this idea with the regression coefficients but I couldn’t get the UMAP picture to look like a string. What kind of params did you use?