Which Model is Better?

Perhaps, that is my initial reaction. Do you find otherwise? I will see soon with the bootstrap.

Not just that it’s too big to compute, but how do you handle the panel data correlations (era and firms) in the training data when you split it up? (two questions earlier) I couldn’t work that out either and have found that my validation using the training data was always more optimistic than my validation using the given validation data.

have you tried stacking them?

Not sure how stacking them would help. Do you guys understand why having repeated observations in different folds is weird?

Here are the results from the bootstrap. The min/max difference from one fold compared to all folds combined is +/- 0.007 corr, which is not small. (all numbers are corr)

The mean/stds for all folds combined (100 trials) is 0.04646 and 0.0004. The mean/stds for the 500 individual folds is 0.04655 and 0.0026.

The 2 std range for all folds combined [0.04564, 0.04728]. The 2 std range for 500 individual folds [0.04128, 0.0518].

Here is the code if anyone would want to try with another model to compare results. I don’t want to keep running this (and I only have one main model right now), but I would be curious how often the decision choice for the model varies with these metrics over the bootstrap. I posted the data in the next message.

from numerapi import NumerAPI
import pandas as pd
import numpy as np
import xgboost as xgb 
import json 
from utils import get_latest_round_data, load_data, numerai_score
import time
import random

napi = NumerAPI()
current_round = napi.get_current_round(tournament=8)  
folder_path = 'round'+str(current_round)
get_latest_round_data(folder_path)

train_data = load_data(folder_path, 'training')
valid_data = load_data(folder_path, 'validation')

targets = [i for i in train_data.columns if i[0:7] == 'target_']
with open(folder_path+"/features.json", "r") as f:
    feature_metadata = json.load(f)
features = feature_metadata['feature_sets']
features['all'] = [i for i in train_data.columns if i[0:7] == 'feature']

train_data = pd.concat((train_data,valid_data))
train_data['erano'] = train_data['era'].astype(int)

eras = np.unique(train_data['erano'])
keep_eras = [i for i in range(max(eras)) if (i-1) % 4 == 0]
eras = list(set(eras).intersection(set(keep_eras)))
eras.sort()

train_data = train_data.loc[train_data['erano'].isin(eras)]

param = {}
target = 
feature_set = 
num_round =
    
ii = 100
jj = 5
results = np.ndarray((ii,1+jj+1))
for i in range(ii):
    
    start = time.time()
    
    random.seed(i)
    fold_ids = [int(random.random()*5) for i in range(train_data.shape[0])]
    
    all_preds = pd.DataFrame()
    
    for j in range(jj):
        
        train_fold_data = train_data.loc[[ids != j for ids in fold_ids]]
        valid_fold_data = train_data.loc[[ids == j for ids in fold_ids]]
        
        dtrain = xgb.DMatrix(train_fold_data.loc[:,features[feature_set]], 
                             label = train_fold_data.loc[:,target])
        dvalid = xgb.DMatrix(valid_fold_data.loc[:,features[feature_set]], 
                             label = valid_fold_data.loc[:,target])
        
        bst = xgb.train()
    
        preds = bst.predict(dvalid)
        preds = pd.DataFrame(preds, index=valid_fold_data.index)
        corr = numerai_score(valid_fold_data['target'], 
                             preds, 
                             valid_fold_data['era'])

        results[i,0] = i
        results[i,1+j] = corr
              
        all_preds = pd.concat((all_preds,preds))
            
    all_preds['target'] = train_data['target']
    all_preds['era'] = train_data['era']
    corr = numerai_score(all_preds['target'], all_preds[0],all_preds['era'])
    
    results[i,1+jj] = corr

    # print(i,
    #       " ".join(results[i,:].round(4).astype(str)), 
    #       round((time.time()-start)/60))

In here, I even divided up the eras into 4 to avoid that overlapping weekly-month era problem. So if you want to train on all data with CV, it’ll even be 20 models with k=5 fold. The 0.045 corr is about what I was getting before when I was looking at training validation numbers and is always higher than my 0.02 to 0.03 diagnostic validation metric. I assume it is because of autocorrelation in the assets over time which I am unable to identify and handle in the cross validation with the data given. But regardless of the metric inflation, perhaps the model choice decision would be consistent even if we were able to properly cross validate across different assets as well.

trail,fold1,fold2,fold3,fold4,fold5,all
0,0.0454512,0.0463021,0.0481612,0.0475084,0.046036,0.0465768
1,0.0461641,0.0426798,0.0488758,0.0466342,0.0485836,0.046368
2,0.0481643,0.0473197,0.0450287,0.0465363,0.046393,0.0466854
3,0.0467998,0.0498067,0.0473212,0.0443931,0.0456131,0.0465906
4,0.0454342,0.0471746,0.0467139,0.0508233,0.0451697,0.0470217
5,0.0430922,0.042396,0.0483958,0.047381,0.0513269,0.0464404
6,0.0494867,0.0431321,0.0475262,0.0469005,0.0448159,0.0461766
7,0.0479359,0.0458342,0.0472137,0.0416656,0.0514214,0.0466997
8,0.0490363,0.0501262,0.0415602,0.0432651,0.0507932,0.0469314
9,0.0526103,0.0439338,0.0447898,0.0456008,0.0463078,0.0466632
10,0.0478756,0.0439681,0.0475459,0.0455541,0.0466427,0.0461699
11,0.0453465,0.0463051,0.0492987,0.0449121,0.0424048,0.0455972
12,0.0486342,0.0460027,0.0459584,0.0411065,0.0522588,0.0467448
13,0.0468255,0.0491371,0.0439315,0.0444841,0.0478756,0.0464577
14,0.0470846,0.0489082,0.0488789,0.0463103,0.0433617,0.046888
15,0.048792,0.0433608,0.0462839,0.0492664,0.047397,0.0469982
16,0.046701,0.0435104,0.049601,0.0505646,0.0423483,0.0464501
17,0.046993,0.0455555,0.0433257,0.0480617,0.0459516,0.0459497
18,0.0474933,0.0456222,0.0480716,0.0471274,0.0445115,0.0465076
19,0.0444521,0.0450144,0.0467531,0.0478124,0.0470413,0.0461556
20,0.0493849,0.0441109,0.0487016,0.047246,0.0450537,0.0468194
21,0.049017,0.0482838,0.0446609,0.0440899,0.0440323,0.0459504
22,0.0480687,0.0507559,0.0459764,0.0435936,0.0483386,0.0472242
23,0.0417204,0.0477246,0.0448494,0.052641,0.0459125,0.0464731
24,0.0409186,0.0485285,0.0479908,0.0513518,0.0453517,0.0467985
25,0.0483,0.0482696,0.0511694,0.0442082,0.0418532,0.0466986
26,0.0510052,0.046943,0.0444164,0.0424078,0.0437092,0.0455664
27,0.0484662,0.0475047,0.0442272,0.0450285,0.0475818,0.0465954
28,0.0473677,0.0469808,0.0442228,0.0471263,0.0457046,0.0462523
29,0.0470475,0.0424258,0.0441187,0.0494812,0.048361,0.0462131
30,0.0450089,0.0481456,0.0483964,0.0430841,0.0467578,0.0462786
31,0.044192,0.0468426,0.0470397,0.0469078,0.0470955,0.0464195
32,0.0498815,0.0471341,0.0424716,0.0479667,0.045624,0.0465243
33,0.0407067,0.0484569,0.0473392,0.0475696,0.0462711,0.0458958
34,0.0505618,0.0470625,0.0443338,0.0482,0.0453517,0.0471104
35,0.0483278,0.0494009,0.0426818,0.0481715,0.0442333,0.0465486
36,0.0472073,0.0481577,0.0446825,0.044485,0.0467739,0.0461266
37,0.043594,0.0463735,0.0477529,0.0479444,0.0445199,0.046037
38,0.0446921,0.0461256,0.0466182,0.0446218,0.0487841,0.0461291
39,0.0458596,0.045189,0.0473348,0.0474093,0.0493812,0.0471188
40,0.0477954,0.0431284,0.0492554,0.0464612,0.0459006,0.0462556
41,0.0435226,0.0465053,0.04756,0.0447214,0.0473549,0.0459075
42,0.045941,0.0433099,0.0459535,0.0446809,0.0532554,0.0463989
43,0.047135,0.0436759,0.0438512,0.0480062,0.049624,0.0464179
44,0.0492976,0.0492129,0.0436938,0.0452712,0.0456339,0.046624
45,0.0516194,0.0471187,0.0473453,0.041776,0.047129,0.046929
46,0.0433419,0.0480354,0.0495389,0.0470666,0.0476467,0.0469895
47,0.0465759,0.0481465,0.0505586,0.0451065,0.0456455,0.0472998
48,0.0465002,0.051564,0.0451451,0.0474905,0.0424634,0.0465274
49,0.0468831,0.0457768,0.0450393,0.0490985,0.0487772,0.0469825
50,0.0417888,0.0455056,0.0509021,0.047003,0.0454096,0.0460697
51,0.0442714,0.0478697,0.0424965,0.0521136,0.0460873,0.0465415
52,0.0495826,0.0458567,0.0497032,0.0459093,0.0432857,0.0467512
53,0.0408756,0.0462859,0.0507189,0.0445478,0.0495572,0.0462834
54,0.0519681,0.0425537,0.0494473,0.0458247,0.0441632,0.0466197
55,0.0459439,0.042358,0.0503978,0.0500509,0.0442735,0.0465332
56,0.0462911,0.044802,0.0478479,0.0460618,0.0455183,0.0459834
57,0.0472916,0.0513738,0.0505625,0.0415692,0.044628,0.0469703
58,0.048421,0.0464198,0.0467036,0.0431291,0.0459133,0.0461762
59,0.0466142,0.0460279,0.045562,0.0396132,0.0483752,0.0451058
60,0.0485735,0.0397927,0.0538569,0.0487737,0.0444589,0.0469261
61,0.0491468,0.0451239,0.0462121,0.0419088,0.0477186,0.0459002
62,0.0430902,0.04388,0.0505072,0.0450501,0.0479288,0.0460039
63,0.0464864,0.0484427,0.0474004,0.0460825,0.0419726,0.046067
64,0.0448574,0.0442045,0.0498019,0.0502939,0.0467484,0.0469565
65,0.0487501,0.0436807,0.047649,0.0479983,0.0451955,0.0466779
66,0.0472595,0.0457648,0.0502793,0.0445404,0.0450672,0.0465302
67,0.0466916,0.0491404,0.0446774,0.0429084,0.0483207,0.0460811
68,0.049345,0.0484095,0.0444826,0.0479081,0.0453472,0.0468229
69,0.0465495,0.0487521,0.0448203,0.0447768,0.0458559,0.0459055
70,0.0449307,0.049067,0.0444911,0.04666,0.0428952,0.0455729
71,0.0497617,0.0482966,0.0451918,0.0450548,0.0460682,0.0466547
72,0.0421529,0.0503088,0.0455051,0.0466188,0.0480857,0.0465094
73,0.0477716,0.0497149,0.0470153,0.0430442,0.0419668,0.0459256
74,0.0400881,0.0487076,0.0455716,0.0499832,0.0498025,0.0467876
75,0.0521937,0.0469653,0.0440891,0.041514,0.0477371,0.0463776
76,0.0506272,0.0501738,0.0410215,0.0441536,0.0476548,0.0466554
77,0.0480731,0.0473878,0.0502569,0.0424778,0.0466857,0.0468076
78,0.0443044,0.0454224,0.0450349,0.0455673,0.0494592,0.0458684
79,0.0509537,0.0440325,0.043193,0.0495951,0.0446791,0.0463975
80,0.0457146,0.0482137,0.0431033,0.0463431,0.0503114,0.0465895
81,0.0444321,0.0542167,0.0484362,0.0419455,0.0445467,0.0466185
82,0.0463462,0.0444241,0.0520595,0.0444156,0.044644,0.0464431
83,0.0443932,0.0497169,0.0458551,0.0443577,0.0497161,0.0466788
84,0.0497253,0.0410755,0.0491929,0.0506663,0.0423233,0.0464948
85,0.0461457,0.0457695,0.0468747,0.0484608,0.048471,0.0471038
86,0.0434023,0.0473817,0.0464394,0.0480758,0.0493162,0.0469385
87,0.0440076,0.0410904,0.0478069,0.0508067,0.0483574,0.0462676
88,0.0469968,0.04731,0.0437514,0.0460746,0.0489373,0.0465023
89,0.0435701,0.0490209,0.0476565,0.048394,0.0443097,0.0464222
90,0.0467009,0.0531009,0.0454824,0.0421911,0.0467472,0.046691
91,0.0447045,0.0481898,0.0485401,0.0447496,0.0465588,0.046507
92,0.0470614,0.0443226,0.0477666,0.0474712,0.0450678,0.0463706
93,0.0446403,0.0448446,0.0491012,0.0468921,0.0439495,0.0457443
94,0.0545186,0.0467705,0.0475951,0.0423732,0.0440008,0.0469763
95,0.0427576,0.0482073,0.0501902,0.0412603,0.0497724,0.0464288
96,0.0489274,0.0407707,0.047828,0.0465981,0.0488622,0.0463332
97,0.046451,0.0492799,0.047756,0.0451662,0.047543,0.0471633
98,0.0449395,0.0422609,0.0486628,0.0502684,0.0453263,0.0461156
99,0.0475424,0.0434008,0.0463483,0.0448359,0.0490639,0.0462389

You won’t know unless you give it a try. It’s helped some of my models.

Sorry man, not sure what you are referring to

Hi @dzheng1887 , I’m a little bit puzzled about your code.
You split the dataframe in 1/4 and took one era (and that’s fine, providing you do the training also for the other 3).
What it’s not clear to me is the split between training and validation data. I mean you made a completely random split of about 80/20 but without taking into account eras so your training and validation data are completely messed up (and, IMHO, it’s not what you’d want for a time series). I think you’d better follow some kind of “era wise kfold” as suggested here in the forum. What do you think?

Definitely, the sample size is not large (it will never be enough for this kind of data I guess). When I say not enough, I mean whatever models you use, the training and validation result difference is huge.
Also, you will notice that the validation result is more or less the same even if you downsample the row for training (say, every 20th row, so the sample size is reduced to only 1/20). I think it is because the era-effect is stronger than individual stock information when it comes to measuring era-wise corr as our tournament goal. So, I would say the “effective sample size” is the number of eras we have in our training set (which is very small). If you tried training a very shallow deep learning before, you will notice that the validation will go up at around after one epoch (reasonably high learning rate), which I haven’t seen in my other deep learning project even with much deeper models.

Hey adalseno, thank you for the message. That definitely would seem to alleviate some of my concern about any autocorrelation between eras for a given firm. But besides the full training vs validation split, I would think there is still some leakage. I will take a look and give it some more thought.

Ah, that is an interesting thought. In such a case, would you then expect model degradation if you remove say 300 of the 600 training eras? But there is no model degradation when you remove half of the stocks in each era? And then you do well in the validation period if the eras you have in the training sample is representative of the type of eras in validation?

I have not yet tried a deep learning model yet. I am thinking of implementing somethings Bayes like though. I like Bayes :slight_smile: But I do plan to one day try some deeper models and I will look for that as well.

I like this technique, it will definitely help with the autocorrelation between eras. Not sure optimal gapping though

2 Likes

I have a model with the sliding window and one with the expending window. Currently, it seems the expanding window is better

2 Likes

It happened to me that some models do not learn at all and the final score is almost equal as predicting the simple mean, using rmse as loss/scoring function. So no wonder that, in such a situation, removing some rows or changing the sample size (or eras distribution) does not affect the performance (and if it does it’s purely by chance). Simply there is no performance at all. Check it out if it’s your case too.

Predictions are ranked first before scored, it does not matter in absolute value how much they deviate from 0.5.

You may be right, but what I saw is that the model does not converge (it does not learn) so it’s simply not working (at least using rmse as loss/metric). In such a situation the more you train the more you overift. You may get decent scores with ensemble (especially using different targets), but essentially you are just getting there by chance not because the model has learnt (you are averaging a bunch of simple means, if the models are not learning).

I tried also to use Spearman as metric/loss but the results, for now, are not encouraging.

On the other hand, as far as I could see, the dataset is a Time Series (from eras point of view), but actually it’s NOT a Time Series since every id is unique so we don’t really have different values over time (therefore no trend, seasonality and so on). A big dilemma, isn’t it?

Yes I was thinking similarly, the metric is the within era asset raking

Yes, that is what I was perplexed about too. The era wise CV may correct for it, but we just don’t know how much leakage is occuring if we don’t understand the firm’s autocorrelation between eras. Don’t know how much gap you need to be comfortable no leakage occurs. It’s really a panel dataset but the asset ids are mixed

I imagine a year is probably enough for momentum?

The term “almost equal as simple mean” does not apply because predictions are ranked, so long as they are not actually a constant, there will be a ranking and the model learns if it produces reasonable ranking performance.

The tournament by design prevents users from modeling time series. Time series info is embedded in the features given to you.

Yes, the tournament is designed to prevent you from modelling time series, that’s why I said that it’s actually NOT a time series.

From my point of view, time series features (lag, rolling, expanding and so on so forth) have a meaning as long as you can sort them (and learn from them), otherwise they have little use.

I’m not saying one can’t find decent models but, in my opinion, the models are not learning enough to be reliable (for a reasonable long time at least).

This one, for example, doesn’t look bad, but it’s not based enough on learning IMHO! So I don’t like it.

PS Don’t look at color codes (they are based on old values found here on the forum and they are not up to date), nor to autocorrelation score (I haven’t yet found a way to properly calculate it; any hint is more than welcome).