16GB Intermediate solution: XGB Era Boosting

Just a heads up. Using the current script on alternate targets results in an “empty dataset error” with xgb. I’ll try to get this sorted as soon as I can, I’ve just got some other stuff on the plate I need to jump on once the Boruta run is finished.

Boruta Shap has wrapped up. You can find the raw results here, and the best features here.

3 Likes

Working on some plug-and-play feature sets. You’ll be able to drop these right in your features.json file.

This is probably the list most of us need. It includes all of the “important” and “tentative” features for all targets. This should work on a 32GM machine as well (I’m fixin’ to find out).

It’s also in the “feature_sets.txt” file, ready to drop into features.json

1 Like

I’ve tested and added a 32GB script to the repo. This uses the “bestplus” feature set, int8 data, and no era paring. It hits a 90% commit charge while processing.

Baseline using minimal parameters compared to the 16GB “xlsmall” features.
md: 3
ne: 500
lr: 0.001
cs: 0.1
ni: 3

mean sharpe
16GB using “xlsmall” feature set 0.0170511 0.482840
32GB using “bestplus” feature set 0.0146597 0.610938

Sharing the results from an advanced script run this week. These were the initially recommended parameters by MDO and should get everyone on the same page. Big box participants can use this as a baseline to explore different params and the small box crew can, hopefully, steer their efforts more effectively when using the example/eb scripts and creating ensembles.

NE: 20,000
LR: 0.001
MD: 6
NL: 2**6
CB: 0.1
Cross Val Downsample = 1
Full Train Downsample = 1

Validation metrics for out of sample training mean sharpe
preds_model_target_neutral_riskiest_50 0.0494102 2.09365
preds_model_target_jerome_20_neutral_riskiest_50 0.048734 2.06158
preds_model_target_thomas_20_neutral_riskiest_50 0.047007 2.04556
preds_model_target_william_20_neutral_riskiest_50 0.0489308 2.03005
preds_model_target_arthur_20_neutral_riskiest_50 0.0495472 2.01925
preds_model_target 0.0599232 2.01545
preds_model_target_ben_20_neutral_riskiest_50 0.046293 2.00648
ensemble_all 0.0532359 1.97708
preds_model_target_thomas_20 0.054444 1.96829
ensemble_neutral_riskiest_50 0.0467554 1.93253
preds_model_target_ben_20 0.0540396 1.92106
ensemble_not_neutral 0.0555182 1.9054
preds_model_target_william_20 0.0567741 1.8844
preds_model_target_nomi_60 0.054169 1.87865
preds_model_target_jerome_20 0.0570743 1.86585
preds_model_target_arthur_20 0.0578719 1.86147
preds_model_target_alan_20 0.0432463 1.81026
preds_model_target_thomas_60 0.047932 1.75095
preds_model_target_nomi_60_neutral_riskiest_50 0.0425226 1.73909
preds_model_target_ben_60 0.0480247 1.73496
preds_model_target_jerome_60 0.0506499 1.71836
preds_model_target_william_60 0.0493479 1.71613
preds_model_target_arthur_60 0.050831 1.71575
preds_model_target_janet_20 0.0437985 1.7139
preds_model_target_paul_20_neutral_riskiest_50 0.0310498 1.70455
preds_model_target_jerome_60_neutral_riskiest_50 0.0413955 1.67698
preds_model_target_thomas_60_neutral_riskiest_50 0.0394446 1.66946
preds_model_target_george_20_neutral_riskiest_50 0.0324472 1.66639
preds_model_target_ben_60_neutral_riskiest_50 0.0392669 1.66213
preds_model_target_alan_20_neutral_riskiest_50 0.0358959 1.65524
preds_model_target_george_20 0.0312737 1.64467
preds_model_target_arthur_60_neutral_riskiest_50 0.0413451 1.64148
preds_model_target_janet_20_neutral_riskiest_50 0.0358536 1.63857
preds_model_target_william_60_neutral_riskiest_50 0.0406023 1.63626
preds_model_target_paul_20 0.0269739 1.57079
preds_model_target_paul_60_neutral_riskiest_50 0.0278925 1.51302
preds_model_target_paul_60 0.0254812 1.48692
preds_model_target_alan_60 0.03706 1.46467
preds_model_target_george_60 0.0261036 1.42696
preds_model_target_george_60_neutral_riskiest_50 0.0271003 1.38735
preds_model_target_janet_60 0.0366974 1.37472
preds_model_target_alan_60_neutral_riskiest_50 0.030381 1.35703
preds_model_target_janet_60_neutral_riskiest_50 0.0295743 1.25600

NE is number of estimators and 20k is not a typo?

@bigbertha Correct. If you take a peek at the official “example_model_advanced.py” file in Numerai’s example scripts, you’ll see the “ideal” params listed which is what I used for that run. Truthfully, those are probably just starter params. I know some of us were using much higher estimators in older versions of the tourney with good results.

I’ll try to get some plug-and-play versions of these ready so people can focus more on creative ensembles and less on duplicate processes.

I was just curios about the number of trees becasue I remember the example predictions used 2k trees, not 20k.
Just now because of your comment I saw the comment in the official file listing these parameters … thx!

1 Like

Getting the error below when using python example_advanced_32GB.py

Entering model selection loop. This may take awhile.
loading model config for advanced_example_model
Traceback (most recent call last):
File “example_advanced_32GB.py”, line 189, in
feature_cols = model_config[“feature_cols”]
TypeError: ‘bool’ object is not subscriptable

Hi @mesomachukwu12 , it looks like the model config was not loaded so False is returned. See utils.py:

    def load_model_config(model_name):
        path_str = f"{MODEL_CONFIGS_FOLDER}/{model_name}.json"
        path = Path(path_str)
        if path.is_file():
            with open(path_str, 'r') as fp:
                model_config = json.load(fp)
        else:
            model_config = False
        return model_config

So how can it be addressed

Hi @mesomachukwu12 , I apologise for the late reply but I’ve been busy during the week.
If you have already trained a model you should have a models and model_configs folder. If not, you need to train your model.
In the file example_advanced_32GB.py on line 40 set model_selection_loop = True. Then on line 199 add exit(0) and run the script. After that you should have your model and the configuration file in the right directories (the program will create them if they do not exist). Switch back model_selection_loop = False and remove exit(0) from line 199 then run the script again. It should work. Let me know.

You may also want to download one of the pre-trained models listed in link_list.csv (they use xgboost) and try one of those for predictions. You will need to tweak the code a little bit to make it working with xgboost though but not that much.

here you are a very very simple way to use the pre-trained model(s). Please check that you have created and activated a virtual environment with at least python 3.8 and the required packages in the right version

You need to have xgboost 1.4.2 installed! pip install xgboost==1.4.2 (1.5.1 won’t work)

import pickle # you need at least python 3.8

from numerapi import NumerAPI

import pandas as pd

# Create a folder named pre_trained_models and download and save md3_ne500_ni0_target_nomi_20.pkl there https://numermodels.s3.us-west-1.amazonaws.com/md3_ne500_ni0_target_nomi_20.pkl

# Load Numerai API

napi = NumerAPI()

# Get current tournament round

current_round = napi.get_current_round(tournament=8)

# if you have an old pickle version you will get an error Unsupported Pickle Protocol 5

model = pickle.load(open('pre_trained_models/md3_ne500_ni0_target_nomi_20.pkl', 'rb'))

print("downloading tournament_data")

napi.download_dataset( "numerai_tournament_data_int8.parquet", f"numerai_tournament_data_int8_{current_round}.parquet")

tournament_data = pd.read_parquet(f"numerai_tournament_data_int8_{current_round}.parquet")

# Check that everything is fine

tournament_data.head()

# Get the feature names from the model

feature_names = model.get_booster().feature_names

# create predictions

predictions = model.predict(tournament_data[feature_names])

# Save to file

predictions = pd.DataFrame(predictions, index = tournament_data.index)

predictions.to_csv("predictions.csv") # use a more meaningful name if you want

It should work fine but it’s bare basic. You can build up using different models, neutralisation, ensemble and so on. Good luck (and remember to thank @objectscience for his work).

1 Like

Yes, that fix the issue
Thanks

1 Like

Had the error below when I used the parameters
NE: 20,000
LR: 0.001
MD: 6
NL: 2**6
CB: 0.1
Cross Val Downsample = 1
Full Train Downsample = 1

KeyError: ‘preds_model_target_neutral_riskiest_50’

preds_model_target_neutral_riskiest_50 is the highest performing model

I fixed this issue when preds_model_target_neutral_riskiest_50 is included in the validation statistics

@adalseno Thank you for all your help! I’ve been buried the last couple of weeks and completely missed all this.

@adalseno put together a function that will allow you to create unique features sets based on the number of times they appear in the raw results. You can find it here, “create_features_dict()”.

(Really appreciate the addition, Thank you!)

1 Like

could you point me in the direction of your fix? I’m having the exact same issue and can’t see where it’s hanging up