Self supervised learning on pseudo labels


I would like to share one of my new experiemnts. I tried to pre-train a NN on pseudo labels. I took the predictions of my ensemble and traind the model on them. To my surprise it achieves higher validation CORR than models trained on the training data.

What I did:

  • get predictions on the tournament data (test set)
  • cut out the validation part
  • minmax scale the predictions
  • train NN on the “new” dataset
  • fine-tune on training set

Validation score is great and the first live results are also promising.
I guess good quality predictions on the test set are key to this exercise.
The new dataset gives great validation corr even without fine-tuning on the training set.

Have a great day!


Pseudo labels would be very different from the origin labels even if you minmax scale them.

I found this in rocket chat:

I think binning pseudo labels in [0,0.25,0.5,0.75,1.0] would be better and I would start my experiemnts too.

Hi Nyuton, I just saw this post but thanks for sharing this.
Any thought on why that happened? Is the trend continuing by the way?

It seems that you generated synthetic dataset from the test features and that helps the overall training. I still struggle to see how the test predictions of a model trained on the training dataset could squeeze more information useful to improve the model performance.
I guess however this is more of a philosophical question about synthetic data itself…