Adversarial Validation

There is a strong assumption behind many machine learning algorithms, and it’s the belief that the data is comprised of i.i.d (independent and identically distributed) random variables. This assumption could be way off when data gathered from different time periods, various sources (like different geolocations or markets), or the number of samples is just too small. The violation of the i.i.d assumption means that the training data may be considerably different from the test data in terms of statistical characteristics.

This situation makes it hard to devise a robust validation scheme to evaluate our ideas and models. To alleviate this problem, we could use adversarial validation. It is a method for selecting training examples that are most similar to test samples and then using them as our validation set.
To implement this idea, we can do the following steps:

  • pick a classifier (preferably one that can give us output probabilities)
  • label test samples as class one (y=1) and training data as class zero (y=0)
  • using k-fold cross-validation scheme on train calculate the out of fold (oof) probabilities of being labeled one for the training examples (we are interested in training samples that are most likely to be indistinguishable from the test data)
  • select a set of most similar training samples (ones with high probabilities) and create a validation dataset and use the rest as the training data
  • throw the kitchen sink at the problem and validate your tries!

Here’s a Python implementation:

import pandas as pd
import numpy as np
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import KFold

#  data
training_data = pd.read_csv("")

tournament_data = pd.read_csv("")

# features
feature_cols = training_data.columns[training_data.columns.str.startswith('feature')]

# training the classifier
kf = KFold(n_splits=5, shuffle=True, random_state=666)
oof = np.zeros(training_data.shape[0])
for i, (tdx, vdx) in enumerate(kf.split(training_data[feature_cols])):
    print(f'Fold : {i + 1}')
    x_train, x_valid, y_train, y_valid = pd.concat([tournament_data[feature_cols], training_data.loc[tdx, feature_cols]]),\
                                                    training_data.loc[vdx, feature_cols], np.array([1] * tournament_data.shape[0] + [0] * len(tdx)), np.array(len(vdx) * [1])
    clf = ExtraTreesClassifier(n_estimators=100,
                               max_features=0.65), y_train)
    oof[vdx] = clf.predict_proba(x_valid)[:, 1]
# picking top 80% as validation data
threshold = np.percentile(oof, 80)
validation = training_data.loc[oof >= threshold]
train = training_data.loc[oof < threshold]

I tried this several years ago and found that the omission of certain data made my models worse. That said, I used completely different training data than what we are using now. I would be careful because I recall that my scores using this method were more volatile, not less. YMMV.