16GB Intermediate solution: XGB Era Boosting

objectscience · October 29, 2021, 6:14pm

I’ve combined the example script with the era boosting script to create a low memory, “high performance” solution. I’ve tested it several times and total memory usage is around 13GB (edit: 16GB on data load, see post below). On an 8 core machine (no threads), it takes a little over an hour for the initial run.

Data is “int8”
The feature set is “medium”
Training Era = every 4th.

The era boosting portion saves each iteration as a model for additional testing and analysis. Parameters here are random, you’ll want to do additional testing

Results from one of the iterative models.

objectscience · October 30, 2021, 2:22pm

I’ll be cleaning the code up this weekend. Moving things off to utils where they belong etc.

objectscience · October 30, 2021, 4:02pm

Just tested the script again on the bigger box which has some better diagnostics.

Using the “medium” feature set, this will just touch 100% of 16GB when reading the training data. It drops off to 13 and then creeps back to 15.4 during the run. If people run into issues we can create a slightly smaller feature set to avoid topping out.

Using the “small” feature set, you’ll see just under 10GB of mem utilization during a run.

medium feature run

small feature run (cleaned up the output a little…)

objectscience · October 31, 2021, 4:38pm

I’ve cleaned up the code a little, put things where they belong. Still have one thing I need to sort out and will get to that this week.

I’ve started working on an optimized feature set that will target around 15GB, leaving a little more headroom when loading the data. I’m using MDO’s BorutaShap code for this, it’s going to take a while. Estimating around 250 hours to process all the targets. I’m going to publish all the results from this as I feel like the community at large will benefit from the knowledge, it also prevents us from doing parallel work, which I’m not a fan of. No reason to be wasting compute cycles on the same stuff when we should be focusing on original/different ensembles. I’ll drop the first half this week and the balance next.

I’ll make this a priority the first of the year when the new data drops, so we can hit the ground running.

~ OS, a.k.a ‘feature_baldish_cognitional_naha’

objectscience · October 31, 2021, 7:16pm

Speaking of ensembles, it’s never too early to begin to think about the possibilities. Between the current number of targets and the growing feature set, we should be able to generate a large number of “unique” submissions. They’ll still be correlated to some degree or another, but the opportunity here to generate “true contribution” should be high (once we know what it is of course.)

This code was ripped from codegrepper.com I remain little more than a python sneak-thief.

from itertools import combinations

# targets

targets = [
"target",
"target_jerome_20",
"target_janet_20",
"target_ben_20",
"target_alan_20",
"target_paul_20",
"target_george_20",
"target_william_20",
"target_arthur_20",
"target_thomas_20",
"target_nomi_60",
"target_jerome_60",
"target_janet_60",
"target_ben_60",
"target_alan_60",
"target_paul_60",
"target_george_60",
"target_william_60",
"target_arthur_60",
"target_thomas_60"
]

ensembleLength = 6 # pick any number here from 1 to 20
for i in combinations(targets, ensembleLength):
  print(i)

objectscience · November 1, 2021, 5:14am

Boruta classifies features as ‘confirmed important’, ‘confirmed unimportant’, and ‘tentative’. I grabbed 400 random features from the 1015 ‘confirmed unimportant’ group on nomi_20 and generated a model with decent results. Unimportant isn’t the same as useless.

objectscience · November 2, 2021, 11:49pm

Started pushing some of the raw output to git.

There have been a couple of targets where the actual features weren’t more predictive than the shadow features, so there are no “important” features in those (Paul, Janet). Seven targets down, 13 to go.

As of right now, there are just over 300 features that fall into the “strong” and “weak” Boruta classification.

objectscience · November 3, 2021, 7:59pm

I’ve just updated the repo with a new feature file that contains 300 features from the Boruta run. These features should work on a 16GB machine and leave headroom in the commit charge on a Windows machine.

From “features2.json” use “xlsmall” for the 16GB feature set.

I’ll continue to add alternative target output from the Boruta run as it drops. This feature set should get you going though.

objectscience · November 3, 2021, 8:09pm

An example of the 16GB feature set ran through the XGB EB script. It is of course cherry-picked to look good-ish.

bguberfain · November 4, 2021, 11:47am

When working with int8 data you should fill nan with 2, not 0.5

github.com

johnputmanii/numerai_xgb_eb/blob/5416c86aff7070b2c02045e788640e1344c38c66/example_intermediate_16GB.py#L123


tournament_data = pd.read_parquet(f'tournament_data_int8_{current_round}.parquet',

                                  columns=read_columns)

nans_per_col = tournament_data[tournament_data["data_type"] == "live"].isna().sum()



# check for nans and fill nans

if nans_per_col.any():

    total_rows = len(tournament_data[tournament_data["data_type"] == "live"])

    print(f"Number of nans per column this week: {nans_per_col[nans_per_col > 0]}")

    print(f"out of {total_rows} total rows")

    print(f"filling nans with 0.5")

    tournament_data.loc[:, features].fillna(0.5, inplace=True)

else:

    print("No nans in the features this week!")



spinner.start('Predicting on validation and tournament data')

# double check the feature that the model expects vs what is available to prevent our

# pipeline from failing if Numerai adds more data and we don't have time to retrain!

model_expected_features = model.get_booster().feature_names

if set(model_expected_features) != set(features):

    print(f"New features are available! Might want to retrain model {model_name}.")

validation_data.loc[:, f"preds_{model_name}"] = model.predict(

objectscience · November 4, 2021, 4:17pm

got ya… fixing right now
thanks for the catch

objectscience · November 4, 2021, 10:23pm

84 Model Dump

I generated a small batch of nomi_20 models last night with the intermediate script and the new feature set. I’ve pushed these up to a public S3 bucket for anyone to grab and investigate. They will run from extremely under-fit to (probably) extremely over-fit and should give deeper insight into the Boruta optimized feature set and the characteristics of this modeling approach.

This should help reduce some of the initial time spent creating and researching the models and get you closer to generating your own work, interesting ensembles, and alternative target modeling.

Models developed with:

Max Depth of 3, 4, 5 & 6
Num Estimators at 500*
Col Sample at 0.1
Learning Rate of 0.001
Num of Iterations at 22*

These are completely random, there are likely better parameters.

Each file follows a naming convention of
Max Depth as md
Num Estimators as ne
Num of Iterations as ni
Target name Ex. md3_ne500_ni0_target_nomi_20

Files range in size from 5MB to almost 500MB

File URLs:
https://numermodels.s3.us-west-1.amazonaws.com/md3_ne500_ni0_target_nomi_20.pkl
https://numermodels.s3.us-west-1.amazonaws.com/md3_ne500_ni1_target_nomi_20.pkl
https://numermodels.s3.us-west-1.amazonaws.com/md3_ne500_ni2_target_nomi_20.pkl
.
.
.

https://numermodels.s3.us-west-1.amazonaws.com/md3_ne500_ni20_target_nomi_20.pkl
.
.
.
https://numermodels.s3.us-west-1.amazonaws.com/md6_ne500_ni20_target_nomi_20.pkl

Use these at your own risk, nothing here is financial advice and no recommendations are being made.
These are strictly for research purposes.

bigbertha · November 5, 2021, 6:40am

Thank you very much for your work.
The URL listing is truncated. I am sure it is possible to construct them all (semi) manually, but is there a way to bulk download them all (one file) or have the URLs listed sequentially?

objectscience · November 5, 2021, 1:48pm

I’m not aware of a way to bulk dl the files outside of the CLI/console, that doesn’t mean they can’t be scraped though, I’m just not sure how.

Here is a list of all the current models. If this is beneficial, I’ll do a deeper dive and push some more of these out.

bigbertha · November 5, 2021, 5:22pm

That list works very well. Thanks again!

jefferythewind · November 5, 2021, 8:15pm

This is really cool, and a lot of great work. One thing that stands out to me is the trial you posted above that attains almost 5% average correlation, that is huge, right?

objectscience · November 5, 2021, 10:08pm

That was part of the TB200 diagnostics, those are always pretty high.

objectscience · November 6, 2021, 12:56am

Creating a Feature Neutral Ensemble

I’ve been playing around with feature neutralization in my own work and wanted to pass along some ideas that will work with our sample models. These aren’t recommendations, just random outtakes to get you thinking in new directions.

From our models I grabbed:

ni20_target_nomi_20
ni15_target_nomi_20
ni10_target_nomi_20
ni5_target_nomi_20

I ran each of these models through a series of neutralizations:

100% of the features neutralized by a factor of 1.0
100% of the features neutralized by a factor of 0.75
100% of the features neutralized by a factor of 0.5
100% of the features neutralized by a factor of 0.25
I repeated this process for 75%, 50%, and 25% of the features for all of the selected models.
You can see sample results for each model here.

From here I selected a single iteration (ni) from each modeling group, based on its highest APY with a minimum Sharpe of +1.0. These selections are noted by an " * " in the linked table.

The only exception to this is the selection as at " ** ". When I reviewed the choices, I had two from “Neutralize 0.25 features…” and none from “Neutralize all features…” The OCD in me insisted at that point I must have one from each group and " ** " was used instead…

Those results were ensembled and this is the result:

hellozml · November 6, 2021, 5:10am

Wow. Thanks for sharing.
You save me tons of man hours and avoided kernel restarts since working with the new massive data_set. Almost gave up.

Eventually, i will need to invest on more ram soon, heard some new massive^2 coming our way.

Cheers

objectscience · November 6, 2021, 1:55pm

I hope there will be room for models like this in the competition for a long-time. When the new data drops and I get my initial work finished, I plan on duplicating this if at all possible. In a perfect world, we can generate two or three 16GB feature sets and corresponding model sets. This will allow for a lot of creative ensembling.