A powerful way to regularize neural networks is by applying noise during training, whether it be to the inputs, hidden-unit activations, weights, or gradients. An early example of this is additive Gaussian noise applied to the inputs of denoising autoencoders. A more recent example is Dropout, in which multiplicative binomial noise is applied to inputs or hidden unit activations. While training a network to be invariant or robust to these types of noise can be beneficial, such noise lacks structure that we may wish to be invariant/robust to as well. For example, in image classification it is common to randomly generate variants of images in the training set by rotating, rescaling, color shifting, etc. in order to encourage the network to learn classification rules that are invariant to semantically trivial changes in the images. Is there an analogue to this type of data augmentation that could be used with Numerai data? I don’t think there’s anything quite as conceptually clean, but I think we can do better than standard types of noise.

A major concern when modeling the data is taking on too much feature exposure because a feature could unexpectedly reverse the sign of its correlation with the target and over-dependence on that feature would then wreck prediction performance. Ideally, we would like our models to be robust to feature reversals. It is of course impossible to be completely robust to an extreme (and hopefully unlikely) situation where all of the features reverse their correlation with the target. But I have found that training a network while reversing the sign of a randomly selected 25% of features at each iteration to be quite beneficial during training. The network naturally learns to reduce its maximum feature exposure and tends to spread exposure across many features rather than relying mostly on only a few. Interestingly different choices of network architecture can lead to networks that perform similarly on validation, but have very different feature exposure profiles. Below are plots from four models showing their feature exposure (i.e. feature correlation with prediction per era) for each validation era. It is clear that the exposure patterns are quite different and often strongly opposing. The max feature exposures were also generally < 0.2 which is usually difficult to obtain without explicitly applying a penalty on exposure of applying feature neutralization.

Try it yourself and let me know what you think. It should be combinable with other ideas as well. I like to follow training using this by further reducing feature exposure as described in my response here: Model Diagnostics: Feature Exposure

Practical tips and suggestions:

- Make sure your features and targets are centered at 0, by subtracting 0.5 from each. (You should already be doing something like this if you’re training neural networks. If not, SHAME!)
- Use early stopping
- Use other kinds of noise in your network as well, e.g. DropOut and/or the CoupledGaussianDropout I invented and put below because I’m feeling generous
- Use eras as mini-batches
- Try different optimizers. I really like Follow The Moving Leader (easily found using Google for your favorite NN framework)
- Experiment with different architectures, standard feedforward and nets with residual connections work IME
- I like training neural networks like making good BBQ: low (learning rate) and slow (many epochs)

```
## To be used on input to neural network. Make sure input is centered at 0!
class FeatureReversalNoise(nn.Module):
def __init__(self, p=0.25):
super(FeatureReversalNoise, self).__init__()
if p < 0 or p > 1:
raise ValueError("probability has to be between 0 and 1, " "but got {}".format(p))
self.p = p
def forward(self, x):
if self.training:
binomial = torch.distributions.binomial.Binomial(probs=1-self.p)
noise = 2*binomial.sample((1,x.shape[1])) - 1
return x * noise.cuda()
else:
return x
```

```
# This is used to add noise to neural net activations. It differs from the Gaussian Dropout suggested
# in the original Dropout paper in that the scale of the noise is proportional to the activation such
# that activation level equals the variance of noise (times alpha). Kinda like how real neurons have
# Poisson-ish noise
class CoupledGaussianDropout(nn.Module):
def __init__(self, alpha=1.0):
super(CoupledGaussianDropout, self).__init__()
self.alpha = alpha
def forward(self, x):
if self.training:
stddev = torch.sqrt(torch.clamp(torch.abs(x), min=1e-6)).detach()
epsilon = torch.randn_like(x) * self.alpha
epsilon = epsilon * stddev
return x + epsilon
else:
return x
```