Feature reversing input noise

A powerful way to regularize neural networks is by applying noise during training, whether it be to the inputs, hidden-unit activations, weights, or gradients. An early example of this is additive Gaussian noise applied to the inputs of denoising autoencoders. A more recent example is Dropout, in which multiplicative binomial noise is applied to inputs or hidden unit activations. While training a network to be invariant or robust to these types of noise can be beneficial, such noise lacks structure that we may wish to be invariant/robust to as well. For example, in image classification it is common to randomly generate variants of images in the training set by rotating, rescaling, color shifting, etc. in order to encourage the network to learn classification rules that are invariant to semantically trivial changes in the images. Is there an analogue to this type of data augmentation that could be used with Numerai data? I don’t think there’s anything quite as conceptually clean, but I think we can do better than standard types of noise.
A major concern when modeling the data is taking on too much feature exposure because a feature could unexpectedly reverse the sign of its correlation with the target and over-dependence on that feature would then wreck prediction performance. Ideally, we would like our models to be robust to feature reversals. It is of course impossible to be completely robust to an extreme (and hopefully unlikely) situation where all of the features reverse their correlation with the target. But I have found that training a network while reversing the sign of a randomly selected 25% of features at each iteration to be quite beneficial during training. The network naturally learns to reduce its maximum feature exposure and tends to spread exposure across many features rather than relying mostly on only a few. Interestingly different choices of network architecture can lead to networks that perform similarly on validation, but have very different feature exposure profiles. Below are plots from four models showing their feature exposure (i.e. feature correlation with prediction per era) for each validation era. It is clear that the exposure patterns are quite different and often strongly opposing. The max feature exposures were also generally < 0.2 which is usually difficult to obtain without explicitly applying a penalty on exposure of applying feature neutralization.
Try it yourself and let me know what you think. It should be combinable with other ideas as well. I like to follow training using this by further reducing feature exposure as described in my response here: Model Diagnostics: Feature Exposure

Practical tips and suggestions:

  • Make sure your features and targets are centered at 0, by subtracting 0.5 from each. (You should already be doing something like this if you’re training neural networks. If not, SHAME!)
  • Use early stopping
  • Use other kinds of noise in your network as well, e.g. DropOut and/or the CoupledGaussianDropout I invented and put below because I’m feeling generous
  • Use eras as mini-batches
  • Try different optimizers. I really like Follow The Moving Leader (easily found using Google for your favorite NN framework)
  • Experiment with different architectures, standard feedforward and nets with residual connections work IME
  • I like training neural networks like making good BBQ: low (learning rate) and slow (many epochs)

## To be used on input to neural network. Make sure input is centered at 0!
class FeatureReversalNoise(nn.Module):
    def __init__(self, p=0.25):
        super(FeatureReversalNoise, self).__init__()
        if p < 0 or p > 1:
            raise ValueError("probability has to be between 0 and 1, " "but got {}".format(p))
        self.p = p

    def forward(self, x):
        if self.training:
            binomial = torch.distributions.binomial.Binomial(probs=1-self.p)
            noise = 2*binomial.sample((1,x.shape[1])) - 1
            return x * noise.cuda()
            return x
# This is used to add noise to neural net activations. It differs from the Gaussian Dropout suggested
# in the original Dropout paper in that the scale of the noise is proportional to the activation such
# that activation level equals the variance of noise (times alpha). Kinda like how real neurons have 
# Poisson-ish noise
class CoupledGaussianDropout(nn.Module):
    def __init__(self, alpha=1.0):
        super(CoupledGaussianDropout, self).__init__()
        self.alpha = alpha

    def forward(self, x):
        if self.training:
            stddev = torch.sqrt(torch.clamp(torch.abs(x), min=1e-6)).detach()
            epsilon = torch.randn_like(x) * self.alpha

            epsilon = epsilon * stddev

            return x + epsilon
            return x

Also rather than choose one of the four models above, I ensembled them all, along with my XGBoost model (⅕ weight each) to produce a final prediction and then reduced maximum feature exposure down to 0.075. Given the good validation performance I was seeing for this model, I uploaded it under my NMRO account for round 245. The metrics are below and overall look better than any other model I have, so I’m fairly optimistic about it. (The exposure number below is higher than 0.075 because I reranked the predictions after doing the exposure reduction optimization)


With such a high val mean and low feature exposure, I would expect MMC to be a lot higher, that’s surprising.

Hi, it’s me! The craziest brazilian newbie ever.

I’m from the ghetho and ghettoboys don’t have enough compute power to run those Michael Oliver’s super cool NNs. But i have imagination and had prepared some adjustment for those folks like me who are taking limited computing power but still wanna get rich… i mean improve the metamodel of course.

What i did was took the Michael Oliver’s idea and consider as a general regularization method than can be perfectally used with a boosted trees algorithm (for example), i also noticed that the conceptual structure fits well with the boosted eras algorithm.

So those are my adaptions:

  • Forget NN’s and special custom loss functions, ghetoboys uses XGBoost and classical Feature Neutralization

  • Before you start with the random reversing thing you can train your XGBoost for some iterations (50, 100 or 200 is enough), as a kind of bootstrap or something

  • And when doing the random reversing training part you can iterate for more than just one time before reversing random features again. That’s why a said that “fits well with the boosted eras algorithm”

Ok, so taking those adjustments i was able to produce this little guy here:

I’ve made a function for doing the reversing thing (in R):

random_slicer <- function(features, slice_percent){

train_slice <- features[,4:313]
feat_list <- sort(sample(1:310,round(310*slice_percent)))

 for(i in length(feat_list)){
    train_slice[,feat_list[i]] <- (-1)*(train_slice[,feat_list[i]] - 0.5) + 0.5

Very nice! I was hoping someone would try this with XGBoost as well :grinning: Looks like you got it working pretty nicely and have a great Feature Neutral Mean score, congrats!

1 Like

Thank you Michael. Hope contribute more with the community by the next months and years, my quant skills became extremely more professionals after i’ve joined the tournament as an active community’s member. So i still have a lot to retribute!

I’ll work on something new related whith this method, gonna share if succed.

For now i have to provide the Diagnostics A/B test for i’ve done in the results of my previous reply. By the left is the model without the technique and on the rigth the model with the technique. Both with the same xgb parameters, total iterations and 100% FN.

Eric Reis



Thanks for the interesting post. Regarding the feature and target centering when using NN shouldn’t this step be unnecessary if the NN layers have biases?


It generally helps convergence since you don’t have to move biases as much and in this case you really want them to be centered if you’re multiplied by -1.

1 Like

This last statement almost sounds like a commercial from Arby’s :cut_of_meat: :cowboy_hat_face:


is this still running in NMRO? rd 245 resolution didn’t look too good.

Judging anything based on one round is a bad idea. The recent rounds have also been especially weird.


These are great insights!

I’ve been working with XGBoost and sklearn a bit on numerai data, but I’m new to PyTorch.

Any pointers on how would I go about creating the mini batches from the eras as you suggest?

Thank you for the tips!
What is the intuition behind eras as mini-batches? Wouldn’t we want each step to be in the direction of lower loss values in more eras, as opposed to one era? I’ve been using quite large batches on shuffled data and it seemed to result in better risk metrics performance over non-shuffled data (which is different than eras as mini-batches but similar)


Can anyone point to some code with pytorch where we use the eras as mini batches? Been cracking my head about this for 1 week now.

eras = df.era.unique()
for era in eras:
   dfs = df[df.era == era]
   x = torch.from_numpy(dfs[features].values).float()
   y = torch.from_numpy(dfs.target.values).float()

Can a kind soul share the code that generates the mini batches by iterating through eras for a Keras model?

class DataSequence(tf.keras.utils.Sequence):

    def __init__(self, df, features, erasPerBatch=1, shuffle=True):
        self.df = df
        self.features = features
        self.shuffle = shuffle
        self.eras = df.era.unique()
        if self.shuffle == True:
        self.erasPerBatch = erasPerBatch
        self.df['target_aux'] = self.df[target]
    def __len__(self):
        return len(self.eras) // self.erasPerBatch

    def on_epoch_end(self):
        if self.shuffle == True:
            self.df = self.df.sample(frac=1).reset_index(drop=True)

    def __getitem__(self, idx):

        myEras = []
        for i in range(self.erasPerBatch):
            myEras.append( self.eras[idx*self.erasPerBatch+i] )
        X = self.df.loc[self.df.era.isin(myEras), self.features].values
        y = self.df.loc[self.df.era.isin(myEras), self.features + ['target_aux', 'target']].values
        X = np.split(X, X.shape[1], axis=1)
        y = np.split(y, y.shape[1], axis=1)
        return X, y

you say you use residual layers, which I would suppose need some temporal dimension. Do you use eras as temporal information? If so, what is your reasoning about having no information about the era for the live data? Is there another way to implement rnns without temporal dimension (which would seem weird to me), or is there a heuristic how you infer the live era?

Hi @mdo when you mention nets with residual connections you mean to skip connections? Would this forward function represent it or you are referring to more complex nets with Blocks and Bottlenecks?

  def forward(self, x):
      x = self.linear0(x)
      x1 = x 
      x = F.relu(x)
      x = self.dropout(x)
      x = F.relu(self.linear1(x))
      x = self.dropout(x)
      x = F.relu(self.linear2(x))
      x = self.dropout(x)
      x = F.relu(self.linear3(x))
      x = self.dropout(x)
      x = torch.add(x, x1)
      x = F.relu(self.linear4(x))
      x = self.sigmoid(x)
      return x


@olivepossum I would look to use torch.cat or max pooling instead of torch.add. Cat is more versatile to a bunch of output dimensions.