The top submission to the Kaggle Jane Street competition winner posted their models and some discussion. Numerai and that Kaggle competition are fairly similar using low signal market data and you can also use multiple targets to predict just one target on which you’re ultimately scored. The initial idea for this model architecture came from this notebook and this paper (Deep Bottleneck Classifiers in Supervised Dimension Reduction).
The author of the initial code explains “The idea of using an encoder is to denoise the data.” The competition winner, Yirun Zhang explains the model really well (I’ve made a few edits so it’s more applicable to the Numerai dataset):
"Deep Learning Model:
- Use autoencoder to create new features, concatenating with the original features as the input to the downstream MLP model
- Train autoencoder and MLP together
- Add target information to autoencoder (supervised learning) to force it to generate more relevant features, and to create a shortcut for backpropagation of gradient
- Add Gaussian noise layer before encoder for data augmentation and to prevent overfitting
- Use swish activation function instead of ReLU to prevent ‘dead neuron’ and smooth the gradient
- Batch Normalisation and Dropout are used for MLP
- Only monitor the MSE loss of MLP instead of the overall loss for early stopping"
Here is Yirun’s diagram:
The Numerai architecture is the same, but we can just use regression loss functions instead of classification loss functions (i.e. MSE instead of BCE). Also, we can use a different number of targets. For example, you can have the model predict all of the 20 day targets at once and then the final prediction would be the mean of all of those predictions. My artistic interpretation of how it would look for Numerai:
The model outputs 3 different vectors: 1) it tries to recreate the feature vector after passing through an autoencoder to compress the feature space into a latent space. 2) it uses the decoder from the autoencoder to try to predict the targets (so it can generate more relevant features in the latent space) and 3) it uses a normal MLP to try to predict the multiple targets at once which can be averaged or ensembled for a final prediction.
With some hyperparameter searches, slightly different from the ones in the code below, initial results on validation (with zero feature neutralization) look quite good and fairly different from the new data’s example predictions:
Next steps and other thoughts:
- Tune hyperparameters with Optuna
- Ensemble CV folds and multiple models
- Try different combinations of loss functions and targets
- Train each era as a batch (use tf.keras.utils.Sequence)
- Try different combinations of ensembling the target outputs
Here is the modified code for Numerai predictions from Yirun’s notebook to get you started, but you may need a few dependencies and other variable definitions. There may be errors or things that can be done better, appreciate any input:
def create_architecture(num_columns, num_labels, hidden_units, dropout_rates, lr=1e-3):
tf.keras.backend.clear_session()
inp = tf.keras.layers.Input(shape=(num_columns,))
x0 = tf.keras.layers.BatchNormalization()(inp)
encoder = tf.keras.layers.GaussianNoise(dropout_rates[0])(x0)
encoder = tf.keras.layers.Dense(hidden_units[0])(encoder)
encoder = tf.keras.layers.BatchNormalization()(encoder)
encoder = tf.keras.layers.Activation("swish")(encoder)
decoder = tf.keras.layers.Dropout(dropout_rates[1])(encoder)
decoder = tf.keras.layers.Dense(num_columns, name="decoder")(decoder)
x_ae = tf.keras.layers.Dense(hidden_units[1])(decoder)
x_ae = tf.keras.layers.BatchNormalization()(x_ae)
x_ae = tf.keras.layers.Activation("swish")(x_ae)
x_ae = tf.keras.layers.Dropout(dropout_rates[2])(x_ae)
out_ae = tf.keras.layers.Dense(num_labels, activation="sigmoid", name="ae_targets")(
x_ae
)
x = tf.keras.layers.Concatenate()([x0, encoder])
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Dropout(dropout_rates[3])(x)
for i in range(2, len(hidden_units)):
x = tf.keras.layers.Dense(hidden_units[i])(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Activation("swish")(x)
x = tf.keras.layers.Dropout(dropout_rates[i + 2])(x)
out = tf.keras.layers.Dense(num_labels, activation="sigmoid", name="targets")(x)
model = tf.keras.models.Model(inputs=inp, outputs=[decoder, out_ae, out])
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=lr),
loss={
"decoder": tf.keras.losses.MeanSquaredError(), # how does the decoder do on translating back to features
"ae_targets": tf.keras.losses.MeanSquaredError(), # how does the decoder do with predicting targets
"targets": tf.keras.losses.MeanSquaredError(),
},
)
return model
targets = [
"target_nomi_20",
"target_jerome_20",
"target_janet_20",
"target_ben_20",
"target_alan_20",
"target_paul_20"
]
model_name = f"keras_{datetime.now().strftime('%s')}"
params = {
"num_columns": len(feature_names),
"num_labels": len(targets),
"hidden_units": [96, 96, 896, 448, 448, 256],
"dropout_rates": [
0.035,
0.035,
0.4,
0.1,
0.4,
0.3,
0.25,
0.4,
],
"lr": 1e-4,
}
model = create_architecture(**params)
history = model.fit(
X = train_data[feature_names].values
y = [train_data[feature_names].values,
train_data[targets].values,
train_data[targets].values]
epochs=1,
)