Eejits guide to Numer.ai

Along my starting journey with of Machine learning was with Kaggle data. but it just wasn’t giving the correct understanding of long term aspirations of building AI that would take over the Universe and destroy the Borg.

Upon hearing of this magical tournament of computer wielding smart arses, where models battle amongst the silicone burning RTX 3090’s. But a retro gaming machine may never compete against these power houses. But fear not because you can use Colabs.

But going through the Numer.ai tutorial and given a basic XGBRegressor model that works surprisingly well. Setup an automated system that allows me to download, run. But I needed time to go through the code fully and understand the full process. Over time the model did ok. but knowing everyone has mostly the same model. I knew I’d have to go back.

So my latest project was to demystify some of this, mainly for myself.
SO without any further creditability.

Here is my early work on GitHub - gnellany/numerai: Stuff I'm working on

If you are using the model that numer.ai supplies and running in python
"
model = XGBRegressor(n_estimators=10000,learning_rate=0.01,subsample=0.3,colsample_bytree=0.1,max_depth=5,
booster=‘gbtree’,tree_method=‘gpu_hist’,predictor=‘gpu_predictor’,
reg_lambda=0.0009,reg_alpha=23,random_state=42)
"
To you new users, remember if using the starting code to delete the example_model.xgb

After building a few different models with this. I decided to automate a little more to find the best variables to use with this code. If you kept reading this far the code you are looking for is called Long_train

The current results with the model trainings can be found:

The future Issue I face is What is the MMC and what is a target for these?

Thanks for sharing the code. MMC definition was well explained in the tournament introduction documents. Basically, it is the performance indicator with your peers.

Update to now legacy model code with 97%

model = XGBRegressor(n_estimators=100000, learning_rate=0.0029815, subsample=0.9, colsample_bytree=0.06576536,
max_depth=5,
booster=‘gbtree’, tree_method=‘gpu_hist’, predictor=‘gpu_predictor’,
reg_lambda=0.1, reg_alpha=24, random_state=42)