AutoEncoder and multitask MLP on new dataset (from Kaggle Jane Street)

@olivepossum correct me if I am wrong, the additive Gaussian noise seems not the same as the TensorFlow one. The sigma should be absolute rather than relative?

@jrai thanks for the comments. I thought you just do CV on the hp in “params”. I have also tried DenseNet and “autofeature” (similar to autoencoder but uses subnetwork to predict masked features), i need to use differential spearman corr to get reasonable results. Using MSE is not very good in my previous network. I have also added a bunch of different losses like maximizing MMC for example prediction, minimizing feature exposure.

How many epochs do you train before you see the loss going down? On my potato PC I trained for 10 epochs and the loss of the provided code is just stationary

the loss should go down right away if everything is flowing correctly

@jrai did you manage to get good live results with this architecture? With my ‘port’ to pytorch I managed to get good val but I guess I have used way too much that data.

No unfortunately not yet. Here’s one account with this MLP + AE architecture plus feature neutralization: Numerai. It started off looking promising. It’s definitely easy to overfit and that’s likely what I did, but I also like the model’s metamodel correlation. It may still have some good performance in different regimes. Also just rolling out some other models to test this architecture with other parameters.

2 Likes

Hi @olivepossum I just noticed your pytorch implementation, that’s great. I am going to try the code, but I did notice something in your training loop. At the beginning of each iteration, you need to call optimizer.zero_grad() at the top. Pytorch by default accumulates gradients, so what you’re doing here is adding the gradients to the previous values at each step and back propagating, which isn’t usually what we want. For most vanilla MLP style optimization, we need to call zero_grad at the top to set all the gradients back to zero before computing the loss.

Just in general of considering and comparing alternate models, we should always look at cross validation scores. This way we compare out-of-sample performance on as much data as possible. After a long journey I’m fairly convinced that this is really the only way to do it. For neural nets I find it always adds a layer of complexity to the algo, since you need to automate things like how long you train for or early stopping. The training time should even be optimized for. I think this is the only way to objectively do it. If we’re tuning the model by just looking at validation performance after fitting to training data, we can’t tell if the improved performance works on other folds of the data or just validation.

1 Like

You mean it should look like this right?

for epoch in epochs:
      for era in eras:  
            batch_count += 1
            x, y = get_data(era)
            out_decoder, out_ae, out = aemlp(x)
            loss_out_decoder = func_loss_out_decoder(out_decoder, x)
            loss_out_ae = func_loss_out_ae(out_ae, y)
            loss_out = func_loss_out(out, y)
            loss = (loss_out_decoder + loss_out_ae + loss_out)/3
            optimizer.zero_grad()
            loss.backward()
            optimizer.step() 
      acc_loss_train += loss
loss_train = acc_loss_train / batch_count
check_early_stopping(validation_data)
1 Like

yes exactly. I might just put it all the way at the top for good measure but I think that yes that should work.

1 Like

So what is the target y supposed to be? A vector with all the targets?

EDIT More importantly what shape data should I feed to aemlp? I have it instantiated here but when I try to feed it one era of data, I throughs errors about the shape of the data, and I can’t reshape it to work at the moment.

In PyTorch, the first dimension needs to be the same, so I will suggest you use the number of eras as batch size (1st dimension), then the second dimension is the number of stock and the last dimension as feature dimension.