AutoEncoder and multitask MLP on new dataset (from Kaggle Jane Street)

The top submission to the Kaggle Jane Street competition winner posted their models and some discussion. Numerai and that Kaggle competition are fairly similar using low signal market data and you can also use multiple targets to predict just one target on which you’re ultimately scored. The initial idea for this model architecture came from this notebook and this paper (Deep Bottleneck Classifiers in Supervised Dimension Reduction).

The author of the initial code explains “The idea of using an encoder is to denoise the data.” The competition winner, Yirun Zhang explains the model really well (I’ve made a few edits so it’s more applicable to the Numerai dataset):

"Deep Learning Model:

  • Use autoencoder to create new features, concatenating with the original features as the input to the downstream MLP model
  • Train autoencoder and MLP together
  • Add target information to autoencoder (supervised learning) to force it to generate more relevant features, and to create a shortcut for backpropagation of gradient
  • Add Gaussian noise layer before encoder for data augmentation and to prevent overfitting
  • Use swish activation function instead of ReLU to prevent ‘dead neuron’ and smooth the gradient
  • Batch Normalisation and Dropout are used for MLP
  • Only monitor the MSE loss of MLP instead of the overall loss for early stopping"

Here is Yirun’s diagram:

The Numerai architecture is the same, but we can just use regression loss functions instead of classification loss functions (i.e. MSE instead of BCE). Also, we can use a different number of targets. For example, you can have the model predict all of the 20 day targets at once and then the final prediction would be the mean of all of those predictions. My artistic interpretation of how it would look for Numerai:

The model outputs 3 different vectors: 1) it tries to recreate the feature vector after passing through an autoencoder to compress the feature space into a latent space. 2) it uses the decoder from the autoencoder to try to predict the targets (so it can generate more relevant features in the latent space) and 3) it uses a normal MLP to try to predict the multiple targets at once which can be averaged or ensembled for a final prediction.

With some hyperparameter searches, slightly different from the ones in the code below, initial results on validation (with zero feature neutralization) look quite good and fairly different from the new data’s example predictions:

Next steps and other thoughts:

  • Tune hyperparameters with Optuna
  • Ensemble CV folds and multiple models
  • Try different combinations of loss functions and targets
  • Train each era as a batch (use tf.keras.utils.Sequence)
  • Try different combinations of ensembling the target outputs

Here is the modified code for Numerai predictions from Yirun’s notebook to get you started, but you may need a few dependencies and other variable definitions. There may be errors or things that can be done better, appreciate any input:

def create_architecture(num_columns, num_labels, hidden_units, dropout_rates, lr=1e-3):
    tf.keras.backend.clear_session()

    inp = tf.keras.layers.Input(shape=(num_columns,))
    x0 = tf.keras.layers.BatchNormalization()(inp)

    encoder = tf.keras.layers.GaussianNoise(dropout_rates[0])(x0)
    encoder = tf.keras.layers.Dense(hidden_units[0])(encoder)
    encoder = tf.keras.layers.BatchNormalization()(encoder)
    encoder = tf.keras.layers.Activation("swish")(encoder)

    decoder = tf.keras.layers.Dropout(dropout_rates[1])(encoder)
    decoder = tf.keras.layers.Dense(num_columns, name="decoder")(decoder)

    x_ae = tf.keras.layers.Dense(hidden_units[1])(decoder)
    x_ae = tf.keras.layers.BatchNormalization()(x_ae)
    x_ae = tf.keras.layers.Activation("swish")(x_ae)
    x_ae = tf.keras.layers.Dropout(dropout_rates[2])(x_ae)

    out_ae = tf.keras.layers.Dense(num_labels, activation="sigmoid", name="ae_targets")(
        x_ae
    )

    x = tf.keras.layers.Concatenate()([x0, encoder])
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.Dropout(dropout_rates[3])(x)

    for i in range(2, len(hidden_units)):
        x = tf.keras.layers.Dense(hidden_units[i])(x)
        x = tf.keras.layers.BatchNormalization()(x)
        x = tf.keras.layers.Activation("swish")(x)
        x = tf.keras.layers.Dropout(dropout_rates[i + 2])(x)

    out = tf.keras.layers.Dense(num_labels, activation="sigmoid", name="targets")(x)

    model = tf.keras.models.Model(inputs=inp, outputs=[decoder, out_ae, out])
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=lr),
        loss={
            "decoder": tf.keras.losses.MeanSquaredError(),  # how does the decoder do on translating back to features
            "ae_targets": tf.keras.losses.MeanSquaredError(),  # how does the decoder do with predicting targets
            "targets": tf.keras.losses.MeanSquaredError(),
        },
    )

    return model

targets = [
    "target_nomi_20",
    "target_jerome_20",
    "target_janet_20",
    "target_ben_20",
    "target_alan_20",
    "target_paul_20"
]

model_name = f"keras_{datetime.now().strftime('%s')}"

params = {
    "num_columns": len(feature_names),
    "num_labels": len(targets),
    "hidden_units": [96, 96, 896, 448, 448, 256],
    "dropout_rates": [
        0.035,
        0.035,
        0.4,
        0.1,
        0.4,
        0.3,
        0.25,
        0.4,
    ],
    "lr": 1e-4,
}

model = create_architecture(**params)

history = model.fit(
    X =  train_data[feature_names].values
    y = [train_data[feature_names].values,
        train_data[targets].values,
        train_data[targets].values]
    epochs=1,
)
18 Likes

I have been using DAE on the legacy dataset, but haven’t got around to tailor it for the new dataset, for what it is worth, here is the python class that I have been using in Pytorch:

class Denoise_Autoencoder(nn.Module):
def __init__(self, in_dimension, embedding_dimension=10):
    super().__init__()
    self.encoder = nn.Sequential(
        nn.Dropout(p=0.1),
        nn.Linear(in_dimension, 256),
        nn.BatchNorm1d(256),
        nn.Hardswish(),
        nn.Dropout(p=0.1),
        nn.Linear(256, 128),
        nn.BatchNorm1d(128),
        nn.Hardswish(),
        nn.Dropout(p=0.1),
        nn.Linear(128, embedding_dimension),)
    self.decoder = nn.Sequential(
        nn.BatchNorm1d(embedding_dimension),
        nn.Hardswish(),
        nn.Dropout(p=0.1),
        nn.Linear(embedding_dimension, 128),
        nn.BatchNorm1d(128),
        nn.Hardswish(),
        nn.Dropout(p=0.1),
        nn.Linear(128, 256),
        nn.BatchNorm1d(256),
        nn.Hardswish(),
        nn.Dropout(p=0.1),
        nn.Linear(256, in_dimension),)
def forward(self, x):
    embedding = self.encoder(x)
    decode = self.decoder(embedding)
    return embedding, decode

from a crude first look, seems the only significant difference from Yirun’s approach is their use of Gaussian noise, which I shall try in due time.

Another departure of my own approach is that I extracted the features, concat them with original feature - it was ok with the old dataset, not sure about the new one - and run it via lightgbm on different tree models

9 Likes

The autoencoders look very similar. Are you training on an era by era basis?

There is a lot you can do with the latent space (using it as features for other models, concatenating it to a subset of features, etc.). The multitask training and combination with autoencoder being trained at the same time just seemed particularly elegant in this case.

I kind of used a “stupid and dumb” approach, so feature generation with DAE is a completely separate step from training pipeline. I just extracted the middle layer - the last layer of the encoder - but for this, it became unsupervised learning and I do it on train+val+test- all in one go. I then use the DAE model to generate features for each live round.

On the actually model training under supervised learning context, i.e. with target, basically repeated CV on separated eras - I didn’t even do timesplit for any of my legacy models - they still do quite ok.

For the new dataset, yes I completely agree there are tones of things to try alongside this direction :slight_smile:

That is a good idea. For this model, the autoencoder could still be pre-trained or fine-tuned on train+val+test as well to leverage the full unsupervised data. Although it would add leakage in validation stats.

I doubt - this is Qinong’s abysmal diagnostics :joy:

Glad to see my solution being discussed here. The unsupervised autoencoder without label information can be trained on train+valid+test, however, the supervised version in my solution should take extra care due to label leakage. So, I trained it in every fold to prevent leakage.

BTW, the highlight in my winning solution is the sample weight training which gives a boost to both public and private scores. But I am not sure if it can be useful in Numerai.

10 Likes

The feature vector compressed by the encode is just another version of the original features, I don’t understand why it could makes the model better.

Nice! Was pretty blown away when I saw the 1st place solution for Jane Street. Last time I saw someone win a tabular Kaggle competition with pure NNs was Porto Seguro 4 years ago, which also used denoising autoencoders.

Very cool to see that you implemented this for Numerai @jrai ! @hedgingcat, nice to see that you are also part of the Numerai community! :wink:

Depending on your cross-val scheme I would be wary of adding the valid + test data in the autoencoder loop, you don’t want to test a model on data in 2018 that uses 2020 data for dimensionality reduction.

Here is some further autoencoder stuff. I used a DAE in TPS Jan 21 & Feb 21 Competitions @kaggle.
Write up Jan:
link to write up

Write up Feb:
link to write up

1 Like