I would like to share one of my new experiemnts. I tried to pre-train a NN on pseudo labels. I took the predictions of my ensemble and traind the model on them. To my surprise it achieves higher validation CORR than models trained on the training data.
What I did:
get predictions on the tournament data (test set)
cut out the validation part
minmax scale the predictions
train NN on the “new” dataset
fine-tune on training set
Validation score is great and the first live results are also promising.
I guess good quality predictions on the test set are key to this exercise.
The new dataset gives great validation corr even without fine-tuning on the training set.
Hi Nyuton, I just saw this post but thanks for sharing this.
Any thought on why that happened? Is the trend continuing by the way?
It seems that you generated synthetic dataset from the test features and that helps the overall training. I still struggle to see how the test predictions of a model trained on the training dataset could squeeze more information useful to improve the model performance.
I guess however this is more of a philosophical question about synthetic data itself…