Era-wise Time-series Cross Validation

Here’s a slightly modified version introducing also the purging. In my understanding the embargo makes sense only for KFold CV so simply purging the periods between the Train and Test sets should be enough to avoid leakages in case of time series CV.
I haven’t had the chance to test it yet but I’ll post an update when I have run the tests.

import numpy as np
from sklearn.model_selection._split import _BaseKFold, indexable, _num_samples


class PurgedTimeSeriesSplitGroups(_BaseKFold):
    def __init__(self, n_splits=5, purge_groups=0):
        super().__init__(n_splits, shuffle=False, random_state=None)
        self.purge_groups = purge_groups

    def split(self, X, y=None, groups=None):
        X, y, groups = indexable(X, y, groups)
        n_samples = _num_samples(X)
        n_folds = self.n_splits + 1
        group_list = np.unique(groups)
        n_groups = len(group_list)
        if n_folds + self.purge_groups > n_groups:
            raise ValueError((f"Cannot have number of folds plus purged groups "
                              f"={n_folds+self.purge_groups} greater than the "
                              f"number of groups: {n_groups}."))
        indices = np.arange(n_samples)
        test_size = ((n_groups-self.purge_groups) // n_folds)
        test_starts = [n_groups-test_size*c for c in range(n_folds-1, 0, -1)]
        for test_start in test_starts:
            yield (indices[groups.isin(group_list[:test_start-self.purge_groups])],
                   indices[groups.isin(group_list[test_start:test_start + test_size])])
3 Likes

Hi @mdo and others -

This is a cool way to ensure that eras/groups aren’t split across the train/test border, but I don’t see it say anywhere in the cross_val_score() docs that group info is taken into account for scoring. If it’s not, then that means the scorer is doing a spearman on the whole test set instead of averaging it over the eras. Seeing as we score on single eras in the tourney, then to get a more tournament-accurate score from cross_val_score() we need it to return the mean spearman over the various testing eras. Does anyone know how to accomplish this with sklearn’s routines?

Thanks

Even after removing eras on the border, I’m receiving significantly better results on all subsequent folds (0.05ish) vs the base fold where train=train and validation=validation (0.025ish). Is this a sign of a bug in my code, or is there something about the training vs validation set that makes the training data easier to predict in general?

I would guess, the training and validation data are selected based on their properties. For example the validation data seems to be harder than real data. Atleast in the old dataset it is the case.

For those using both neutralization and cross-validation, do you apply neutralization to the predictions of each of your folds, or only at the end after averaging the fold predictions? I think it works out to the same in the end, but this does impact the cross validation score that we use to tune hyperparameters and I don’t have a strong intuition on whether this would be desirable or not.