NN architecture for >0.03 CORR on validation set

Hi,

I want to implement a test for this in Pytorch. My idea is to select, let’s say, 50 features (based on MDA, xgboost’s feature importance or any other criteria) and then build 50 models to create the extra features. Each model would have 49 of the 50 features as input features and would try to predict the one left. I would save the values of the last intermediate layer of each model and these values would be the engineered features used in my final model.

The flow would be something like this:
1.- Build a model for each feature to predict that feature using the rest of the features as input, and store the last intermediate layer of each model.
2.- Merge/Join intermediate layer values with the initial dataset (train_data). It would contain the initial features and the engineered ones.
3.- Train a final model with all those features and predicting the target.

But reading the article it says:
The trick is making sure that the feature networks train with the final model rather than a separate process.

My approach would definitely not do that so I guess I’m missing something. Why is this important?

The article also says:
Because we have several auxiliary outputs, we need to tell TensorFlow how much weight to give each one in determining how to adjust the model to improve accuracy. I personally like to give 50% weight to the auxiliary predictions (total) and 50% the the target prediction. Some might find it strange to give any weight to the auxiliary predictions since they are discarded at the loss calculation step. The problem is, if we do not give them any weight, the model will mostly ignore them, preventing it from learning useful features.

Again, I wouldn’t be doing anything like that so I’m wondering if an approach like what I have in mind make sense at all or I’m missing something (I do not have deep knowledge on NN).

Thanks!

Why don’t you just try to implement, what’s in the article? It works…

1 Like

My approach seems easier to implement in PyTorch to me but as I mentioned I’m not an expert in the subject. If conceptually my approach makes sense I would go for it but if not, I would do more research to replicate the implementation of the article.

1 Like

So if you don’t mind asking, how is that model doing on live data?

You can check it out here: Numerai

4 Likes

Cool. Thanks for sharing. Performance seems ok, given that the last rounds were kind of weird anyway, but maybe not as good as your validation results promised?

It needs some more time until we figure it out…

1 Like

It is a long game for sure. At a quick glance your test6 is outperforming test14. If you can say, was the corr on val2 and diagnostics with test6 higher than for 14, or is 6 returning higher corr at the moment despite having a lower corr on val?

6 has lower validation corr and higher live corr than 14. At least for the last 2 weeks, which doesn’t say much…

1 Like

Looking good at least for the first two score, but needs more time.

@nyuton thanks for sharing! I am going to implement it. Can you share any details regarding your network architecture? E.g., how many layers / hidden units did you use in each of your 100 feature networks?

Hi,

sorry, I keep that for myself :slight_smile:
But the model is for sale! Someone just asked it. Contact me in private if you are interested.

1 Like

trying a variation of this and reading the article you linked he mentions The trick is making sure that the feature networks train with the final model rather than a separate process. . Does anyone here have an intuition or data to show why this makes sense? if your feature producing networks are changing during the training of the main network isn’t that just going to make it more difficult for the main network to find any connections? Also what was your reasoning for doing 100 features rather than something like 10?

1 Like

I’ve been trying to cache the outputs of the smaller networks to save memory and reduce training time but I am considering switching to an end-to-end model like the one in the article.

Training the feature extractors with the main network in one cohesive unit MIGHT slow down training, but it shouldn’t prevent it from converging. Each extractor is basically trying to minimize 2 loss functions, its given feature and the target further down the network. There are quite a few neural network architectures that optimize multiple loss functions like VAEs GANs, and YOLO. I suppose you could think of the extractors as trying to predict a given feature while maintaining some “relevance” to the target.

Since the hidden layers being passed from each extractor to the big network are changing constantly during training, I suspect that feeding the big network the original features too helps with stability. It has something static to learn from while the extractors converge to a state that is more stable. Maybe that instablity also has a a regularizing effect like dropout or noise? I’m not sure.

1 Like

I have already established a set of highly relevant 100 features from previous experiments. And it’s also much faster to train with 100 then with 310.
Using all 310 features also gives good result, but 100 is slightly better and a lot faster.

10 features are not enough to get anywhere…

2 Likes

Hi nyuton,

What is your Val COR when you train a model using this set of 100 chosen features without the metod described in the article? Can’t be the selected features be a key for high validation rather than described method? I’m asking that because: 1. I have a set of features which provides me 0.03 val COR with simple lightgbm boosting 2. That would explain why nobody else can get desent results with the method from article.

Mark

4 Likes

Hi JackerParker,

This model is stronger then any other model I have, when trained on the full dataset. I experimented with tuned XGB, RF and MLP models.
But the feature selection definitely improves performance! No doubt about it.

3 Likes

Am I missing something? If its a competition why would you teach everyone to make the same model? In my opinion it takes away the fun of it. Just my opinion.

1 Like

@crownholder There’s are many aspects to a model and training that can make a big difference to performance beyond the basic choice of architecture; e.g. choice of loss function, which activation function, the kernel initialiser, any regularisation, optimiser selection, random seed, learning rate strategy, batch size, early stopping settings… Then there’s the data itself, do you train on everything, a subset (which subset) and so on. So describing a basic approach to an architecture and giving that to 100 people will inevitably result in 100 different models that could perform very differently. And, part of the fun is becoming aware of what could be a new approach, having a go at implementing it, and perhaps putting one’s own spin on it.

5 Likes

I’m also working on a variation, precisely because don’t know how to implement what you mention with Pytorch:
The trick is making sure that the feature networks train with the final model rather than a separate process
Any reading resource to train nn with other nn with Pytorch is more than welcome!
Regarding features, I select a subset of features using Marcos Lopez de Prado MDA technique. My subset is slightly more than 100.