I’m trying to apply MLPs to Numerai and keep running into an issue where many models converge to the same place (~ -0.00311). These networks output a constant value for all samples (near 0.5, but not quite) and the weights in the layer(s) with L2 regularization applied are all very close to 0. This problem came up more and more as I tried to increase the depth of the network.
I believe the problem is happening because of L2 regularization and the noisy nature of Numerai data. L2 acts as a kind of weight decay and the noisy data helps push the weights back and forth around 0 enough that eventually, one by one, they all converge to near 0. Removing L2 regularization always resolves this problem, though it also leads to poor performance (corr is cut roughly in half without L2).
I thought increasing the learning rate would help avoid this problem as it would force larger weight updates and keep them away from 0, but I found the opposite to be true. Also, the problem seems to be more frequent with smaller (more traditional) batch sizes below 512. Intuition would be that smaller batches would help as the stochastic nature of the data would help bounce the weights around more, but I’m finding best results on larger batch sizes (~4k). Both of these observations are counter to what I would expect, so am I missing something here?
If L2 regularization really is to blame, what is right solution? I tried replacing L2 with batch normalization, but that performs much worse than L2. Using large batch sizes doesn’t really seem like the right solution, but it’s the best I have right now.