Feature reversing input noise

A powerful way to regularize neural networks is by applying noise during training, whether it be to the inputs, hidden-unit activations, weights, or gradients. An early example of this is additive Gaussian noise applied to the inputs of denoising autoencoders. A more recent example is Dropout, in which multiplicative binomial noise is applied to inputs or hidden unit activations. While training a network to be invariant or robust to these types of noise can be beneficial, such noise lacks structure that we may wish to be invariant/robust to as well. For example, in image classification it is common to randomly generate variants of images in the training set by rotating, rescaling, color shifting, etc. in order to encourage the network to learn classification rules that are invariant to semantically trivial changes in the images. Is there an analogue to this type of data augmentation that could be used with Numerai data? I don’t think there’s anything quite as conceptually clean, but I think we can do better than standard types of noise.
A major concern when modeling the data is taking on too much feature exposure because a feature could unexpectedly reverse the sign of its correlation with the target and over-dependence on that feature would then wreck prediction performance. Ideally, we would like our models to be robust to feature reversals. It is of course impossible to be completely robust to an extreme (and hopefully unlikely) situation where all of the features reverse their correlation with the target. But I have found that training a network while reversing the sign of a randomly selected 25% of features at each iteration to be quite beneficial during training. The network naturally learns to reduce its maximum feature exposure and tends to spread exposure across many features rather than relying mostly on only a few. Interestingly different choices of network architecture can lead to networks that perform similarly on validation, but have very different feature exposure profiles. Below are plots from four models showing their feature exposure (i.e. feature correlation with prediction per era) for each validation era. It is clear that the exposure patterns are quite different and often strongly opposing. The max feature exposures were also generally < 0.2 which is usually difficult to obtain without explicitly applying a penalty on exposure of applying feature neutralization.
Try it yourself and let me know what you think. It should be combinable with other ideas as well. I like to follow training using this by further reducing feature exposure as described in my response here: Model Diagnostics: Feature Exposure

Practical tips and suggestions:

  • Make sure your features and targets are centered at 0, by subtracting 0.5 from each. (You should already be doing something like this if you’re training neural networks. If not, SHAME!)
  • Use early stopping
  • Use other kinds of noise in your network as well, e.g. DropOut and/or the CoupledGaussianDropout I invented and put below because I’m feeling generous
  • Use eras as mini-batches
  • Try different optimizers. I really like Follow The Moving Leader (easily found using Google for your favorite NN framework)
  • Experiment with different architectures, standard feedforward and nets with residual connections work IME
  • I like training neural networks like making good BBQ: low (learning rate) and slow (many epochs)




## To be used on input to neural network. Make sure input is centered at 0!
class FeatureReversalNoise(nn.Module):
    def __init__(self, p=0.25):
        super(FeatureReversalNoise, self).__init__()
        if p < 0 or p > 1:
            raise ValueError("probability has to be between 0 and 1, " "but got {}".format(p))
        self.p = p

    def forward(self, x):
        if self.training:
            binomial = torch.distributions.binomial.Binomial(probs=1-self.p)
            noise = 2*binomial.sample((1,x.shape[1])) - 1
            return x * noise.cuda()
        else:
            return x
# This is used to add noise to neural net activations. It differs from the Gaussian Dropout suggested
# in the original Dropout paper in that the scale of the noise is proportional to the activation such
# that activation level equals the variance of noise (times alpha). Kinda like how real neurons have 
# Poisson-ish noise
class CoupledGaussianDropout(nn.Module):
    def __init__(self, alpha=1.0):
        super(CoupledGaussianDropout, self).__init__()
        self.alpha = alpha

    def forward(self, x):
        if self.training:
            stddev = torch.sqrt(torch.clamp(torch.abs(x), min=1e-6)).detach()
            epsilon = torch.randn_like(x) * self.alpha

            epsilon = epsilon * stddev

            return x + epsilon
        else:
            return x
15 Likes

Also rather than choose one of the four models above, I ensembled them all, along with my XGBoost model (⅕ weight each) to produce a final prediction and then reduced maximum feature exposure down to 0.075. Given the good validation performance I was seeing for this model, I uploaded it under my NMRO account for round 245. The metrics are below and overall look better than any other model I have, so I’m fairly optimistic about it. (The exposure number below is higher than 0.075 because I reranked the predictions after doing the exposure reduction optimization)

11 Likes

With such a high val mean and low feature exposure, I would expect MMC to be a lot higher, that’s surprising.

Hi, it’s me! The craziest brazilian newbie ever.

I’m from the ghetho and ghettoboys don’t have enough compute power to run those Michael Oliver’s super cool NNs. But i have imagination and had prepared some adjustment for those folks like me who are taking limited computing power but still wanna get rich… i mean improve the metamodel of course.

What i did was took the Michael Oliver’s idea and consider as a general regularization method than can be perfectally used with a boosted trees algorithm (for example), i also noticed that the conceptual structure fits well with the boosted eras algorithm.

So those are my adaptions:

  • Forget NN’s and special custom loss functions, ghetoboys uses XGBoost and classical Feature Neutralization

  • Before you start with the random reversing thing you can train your XGBoost for some iterations (50, 100 or 200 is enough), as a kind of bootstrap or something

  • And when doing the random reversing training part you can iterate for more than just one time before reversing random features again. That’s why a said that “fits well with the boosted eras algorithm”

Ok, so taking those adjustments i was able to produce this little guy here:

I’ve made a function for doing the reversing thing (in R):

random_slicer <- function(features, slice_percent){

train_slice <- features[,4:313]
feat_list <- sort(sample(1:310,round(310*slice_percent)))

 for(i in length(feat_list)){
    train_slice[,feat_list[i]] <- (-1)*(train_slice[,feat_list[i]] - 0.5) + 0.5
    }    
 train_slice  
}
11 Likes

Very nice! I was hoping someone would try this with XGBoost as well :grinning: Looks like you got it working pretty nicely and have a great Feature Neutral Mean score, congrats!

1 Like

Thank you Michael. Hope contribute more with the community by the next months and years, my quant skills became extremely more professionals after i’ve joined the tournament as an active community’s member. So i still have a lot to retribute!

I’ll work on something new related whith this method, gonna share if succed.

For now i have to provide the Diagnostics A/B test for i’ve done in the results of my previous reply. By the left is the model without the technique and on the rigth the model with the technique. Both with the same xgb parameters, total iterations and 100% FN.

Regards
Eric Reis

1 Like

Hi MDO,

Thanks for the interesting post. Regarding the feature and target centering when using NN shouldn’t this step be unnecessary if the NN layers have biases?

Thanks

It generally helps convergence since you don’t have to move biases as much and in this case you really want them to be centered if you’re multiplied by -1.

This last statement almost sounds like a commercial from Arby’s :cut_of_meat: :cowboy_hat_face:

1 Like