Hi everyone,
Combinatorial purged cross fold validation (CPCV) is a concept proposed by Marcos Lopez de Prado in the “Advances in financial machine learning” book and it was discussed here on the forum (Cross-validation done right). My post can be useful only for those who are already familiar with the CPCV concept.
In a normal K-fold CV, you estimate AUC/RMSE/ or other metrics using a small test set from every fold and it’s ok for data sets where you can split data randomly. But in finance, you usually split data grouping by date periods and you cannot expect similar metrics values on the different date periods. Thus, the common solution is to combine all your test sets with predicted targets from every fold and calculate only 1 metric such as returns (COR in case of Numerai) for combined test sets. Using CPCV you combine test sets in multiple paths and get multiple values of COR/Sharpe/… metrics.
Here I present an idea on how to obtain ~75% reduction in time required for random hyperparameters search combined with CPCV. My assumption here is that COR values from different paths in CPCV should follow the normal distribution. And here is an example of average COR values for 1-132 eras obtained using 11paths/12Groups/66splits (N=12, k=2) CPCV I’ve performed on the Numerai dataset: [0.05233829, 0.05250924, 0.05166095, 0.05273881, 0.05253379, 0.05207856, 0.05245192, 0.05234268, 0.05219343, 0.05202705, 0.05215169]. My goal is to find the best hyperparameters in terms of maximal average (by paths) COR. On the first step, I calculate COR values using all 66 splits of train data for 11 paths. But on the next step when I check the next random hyperparameters, I train models only for the first 11 splits to obtain COR value for the first path. If this COR value is less than threshold, I just skip calculations for these hyperparameters. If not - just do the same for the second path (+10 next splits) and so on. Threshold is defined as average (by paths) COR - 3 standard deviation of COR for best hyperparameters at the current iteration. For my example above, the threshold is 0.05143 = 0.05227 - 3 * 0.00028. Indeed, if the COR values for different paths follow the normal distribution, one can estimate probability that at least one of 11 COR values from another “better” hyperparameters (by mean COR) has lower value than defined threshold using the formula: 1 - (1 - 0.0013)**11 = 1.4%. 0.0013 here is the probability that a random sample from normal distribution has value less than “mean - 3 std”, which can be obtained from the Z score table. Thus, only 1.4% of really useful hyperparameters will be skipped by chance in this approach.
In the ideal case when your first hyperparameters will be the best, for checking of 110 random hyperparams you have to calculate only 66 + 109 * 11 splits = 1265 (of 7260 splits required for normal CPCV). In the real case, I’ve calculated only 2079 splits for 110 hyperparams which means 72% reduction in time.
Regards,
Mark