Fast Combinatorial Cross Validation

jackerparker · May 19, 2021, 10:01pm

Hi everyone,

Combinatorial purged cross fold validation (CPCV) is a concept proposed by Marcos Lopez de Prado in the “Advances in financial machine learning” book and it was discussed here on the forum (Cross-validation done right). My post can be useful only for those who are already familiar with the CPCV concept.

In a normal K-fold CV, you estimate AUC/RMSE/ or other metrics using a small test set from every fold and it’s ok for data sets where you can split data randomly. But in finance, you usually split data grouping by date periods and you cannot expect similar metrics values on the different date periods. Thus, the common solution is to combine all your test sets with predicted targets from every fold and calculate only 1 metric such as returns (COR in case of Numerai) for combined test sets. Using CPCV you combine test sets in multiple paths and get multiple values of COR/Sharpe/… metrics.

Here I present an idea on how to obtain ~75% reduction in time required for random hyperparameters search combined with CPCV. My assumption here is that COR values from different paths in CPCV should follow the normal distribution. And here is an example of average COR values for 1-132 eras obtained using 11paths/12Groups/66splits (N=12, k=2) CPCV I’ve performed on the Numerai dataset: [0.05233829, 0.05250924, 0.05166095, 0.05273881, 0.05253379, 0.05207856, 0.05245192, 0.05234268, 0.05219343, 0.05202705, 0.05215169]. My goal is to find the best hyperparameters in terms of maximal average (by paths) COR. On the first step, I calculate COR values using all 66 splits of train data for 11 paths. But on the next step when I check the next random hyperparameters, I train models only for the first 11 splits to obtain COR value for the first path. If this COR value is less than threshold, I just skip calculations for these hyperparameters. If not - just do the same for the second path (+10 next splits) and so on. Threshold is defined as average (by paths) COR - 3 standard deviation of COR for best hyperparameters at the current iteration. For my example above, the threshold is 0.05143 = 0.05227 - 3 * 0.00028. Indeed, if the COR values for different paths follow the normal distribution, one can estimate probability that at least one of 11 COR values from another “better” hyperparameters (by mean COR) has lower value than defined threshold using the formula: 1 - (1 - 0.0013)**11 = 1.4%. 0.0013 here is the probability that a random sample from normal distribution has value less than “mean - 3 std”, which can be obtained from the Z score table. Thus, only 1.4% of really useful hyperparameters will be skipped by chance in this approach.

In the ideal case when your first hyperparameters will be the best, for checking of 110 random hyperparams you have to calculate only 66 + 109 * 11 splits = 1265 (of 7260 splits required for normal CPCV). In the real case, I’ve calculated only 2079 splits for 110 hyperparams which means 72% reduction in time.

Regards,
Mark

jackerparker · May 20, 2021, 9:26pm

An example Python code is available here: Github

nyuton · May 21, 2021, 6:16am

Good idea! CPCV is good bug very time consuming!

bigbertha · May 24, 2021, 8:27am

@jackerparker was kind enough to accept a PR linking CPCV with optuna for a supposed imrovement of the parameter sampling over random search.

github.com

markmipt/fast_combinatorial_cv/blob/main/cpcv.py


from itertools import combinations
import pandas as pd
import lightgbm as lgb
import csv
import numpy as np
from ast import literal_eval
import random
from collections import Counter, defaultdict
import os
import ast
import matplotlib.pyplot as plt
from util.data import read_training_data
from loguru import logger
import optuna

NUMBER_OF_GROUPS = 6
path_to_numerai_training_data = ''
TARGET_NAME = "target"
PREDICTION_NAME = "prediction"

This file has been truncated. show original

sneaky · May 28, 2021, 1:28pm

Nice job! Noob question: how did you come up with the N and k parameters? Was it by an experiment, a heuristic, random, or educated guess?

jackerparker · May 28, 2021, 4:43pm

Hi sneaky! Short answer : random.

I don’t know about benefits which k > 2 can provide me. As for the N, I have two models with N = 6 (as in the original book example) and N = 12 (just for more accurate post-analysis). In my opinion, there is a huge difference between having 1 path metric (“standard” kfold CV) and multi-path metric (CPCV), but the difference between 5 and 11 paths is not that important.

I’ve already attached this figure in the chat: the validation performance for my models generated using N=6 and N=12 are pretty the same (CV was slightly better for 11 path model, validation was the opposite).

For the next experimental workflow I’m using N=6 k=2 CPCV and with the algorithm described here in the post it is close in computational time to the standard 5-fold CV.

Regards,
Mark

Topic		Replies	Views
Time series CV & seperation to live data Data Science	5	968	November 13, 2022
Cross-validation done right Data Science	4	2288	May 2, 2021
Submission core metrics Tournament	3	1771	October 2, 2020
TC Calculation Details Tournament	2	862	May 24, 2022
Will early stopping in cross validation introduce overfitting? Data Science	7	1669	January 1, 2022

Fast Combinatorial Cross Validation

Related topics