Era-wise Time-series Cross Validation

In case you’re not aware, the time-series cross-validation code in sklearn takes a groups argument, but doesn’t actually use it! I like using time-series cross-validation since it prevents you from using any future information to predict out of sample, since your out of sample test set is always in the future. I wrote a sklearn compatible cross validation splitter that can use eras as groups so your splits are always erawise. Below is example code for doing a hyperparameter grid search with XGBoost and era-wise time-series cross validation. My models Niam, NMRO, and MDO were trained in exactly this way (but with different parameter ranges than are used below). MDO also drops some of the worst and best features (according to feature importance) with the exact choices of what to drop determined by this cross validation strategy. See, nothing fancy needed to get a top 3 model :smiley: Now take this information and make even better models!

from sklearn.model_selection._split import _BaseKFold, indexable, _num_samples
from sklearn import model_selection, metrics 
import pandas as pd
import numpy as np
from xgboost import XGBRegressor
import csv
from scipy.stats import spearmanr 


with open('numerai_training_data.csv', 'r') as f:
    column_names = next(csv.reader(f))
    dtypes = {x: np.float32 for x in column_names if
              x.startswith(('feature', 'target'))}
data = pd.read_csv('numerai_training_data.csv', dtype=dtypes, header=0, index_col=0)


features = [f for f in data.columns if f.startswith("feature")]
target = "target_kazutsugi"
data["erano"] = data.era.str.slice(3).astype(int)
eras = data.erano

class TimeSeriesSplitGroups(_BaseKFold):
    def __init__(self, n_splits=5):
        super().__init__(n_splits, shuffle=False, random_state=None)

    def split(self, X, y=None, groups=None):
        X, y, groups = indexable(X, y, groups)
        n_samples = _num_samples(X)
        n_splits = self.n_splits
        n_folds = n_splits + 1
        group_list = np.unique(groups)
        n_groups = len(group_list)
        if n_folds > n_groups:
            raise ValueError(
                ("Cannot have number of folds ={0} greater"
                 " than the number of samples: {1}.").format(n_folds,
                                                             n_groups))
        indices = np.arange(n_samples)
        test_size = (n_groups // n_folds)
        test_starts = range(test_size + n_groups % n_folds,
                            n_groups, test_size)
        test_starts = list(test_starts)[::-1]
        for test_start in test_starts:
            
            yield (indices[groups.isin(group_list[:test_start])],
                   indices[groups.isin(group_list[test_start:test_start + test_size])])


def spearman(y_true, y_pred): 
    return spearmanr(y_pred, y_true).correlation 


cv_score = []
models = []
for lr in [0.006, 0.008, 0.01, 0.012, 0.014]:
    for cs in [0.06, 0.08, 0.1, 0.12, 0.14]:
        for md in [4, 5, 6]:
            models.append(XGBRegressor(colsample_bytree=cs, learning_rate=lr, n_estimators=2000, max_depth=md, nthread=8))



for model in models:
    score = np.mean(model_selection.cross_val_score(
                model,
                data[features],
                data[target],
                cv=TimeSeriesSplitGroups(5),
                n_jobs=1,
                groups=eras,
                scoring=metrics.make_scorer(spearman, greater_is_better=True)))
    cv_score.append(score)
    print(cv_score)
17 Likes

Hi,
thanks for sharing again (also said that in the chat).
The results are like that:
[0.04468379562495979]
[0.04468379562495979, 0.04466911064704264]
[0.04468379562495979, 0.04466911064704264, 0.044610228323998906]

How can I read it?
Since it is nested for loop, does that mean that the first line is lr=0.006 + cs = 0.06 + md = 4 and then the third line is: lr=0.006 + cs = 0.06 + md = 6

Any help, super appreciated!

I believe the print(cv_score) should be outside the for loop.

To answer your question @zempe; try verifying your assumptions by printing those arguments in the for loop, e.g. print(model.learning_rate, score).

@mdo can you elaborate on how your custom class differs from GroupKFold?

Thank you for sharing this code snippet!
It got me thinking about how to do proper cross validation on this dataset.

However I might have found a bug in your code.
Since eras are strings in the Pandas Dataframe taking unique values with Numpy produces for the following group_list variable:
[‘era1’ ‘era10’ ‘era100’ ‘era101’ ‘era102’ … ‘era96’ ‘era97’ ‘era98’ ‘era99’]

The eras are not properly ordered.
This can be fixed by changing the definition of the eras to only take the integer part in account:
eras = pd.Series([int(era[3:]) for era in data.era])

2 Likes

Thanks @koerrie good catch! It’s fixed now. I actually had it right like it is now in my original code, but I guess that’s what I get for trying to simplify code I haven’t looked at in while :smiley:

1 Like

You can just look at the corresponding object in the models list to find the parameters that go with any of the scores

image
image
This is like the latter but splits are only between consecutive eras/groups which is not true for the sklearn version.

2 Likes

I’m fairly new to this so sorry if this is a dumb question. Once you find the optimal set of parameters, and you’re ready to fit the model, is there a way to incorporate the era groups? Or would you just use fit() as normal?

@mdo thanks for the original post and for the plot, really illustrative. I’ve been participating in Numerai for a while but also been working as a data scientist in time series problems. I’ve always used the TimeSeriesSplit approach to force testing on “future” data but always tried to keep using the same amount of training data in each fold.

I mean, if you are using 4 eras for training and 2 for testing in iteration 0, I’d prefer to use 4+2 also in iteration 1 and so on, but seems you’re using 8+2. I don’t have a strong opinion here, and since you are using the same approach for all models it should be ok. In my mind, it resonates as the average error metric won’t be accurate since I’d expect iteration 3 to be better (or worse if your model gets easily overfitted) than iteration 0 because it’s using more training data.

Edit: I’ve found an illustrative image for both approaches.

Is there some reasoning to use “Expanding window” over “sliding window”?

Many of my base models use time-series CV. Until @mdo made that post I did not think anybody else used it. For the models that I have using time-series-CV, I also used an expanding window. My reasoning is that I want the fold size to approach the size of the final training data set because if you use a fixed size which is much smaller than the final training data set size then you might train models that are too greedy and so there is a much greater chance of over-fitting. When we were only allowed 3 models I weighted those ensembles much more to the time-series CVed models and they reached the top 100 very quickly.

Edit: I clearly fail at reading and in-line responses. Zempe was already given an answer to the posed question.