Introductory Colab Notebook Addressing Common Challenges

dliden · March 5, 2021, 12:40am

Hello,

I’ve put together a google colab notebook covering the following topics:

Handling numerapi API keys in colab without too much friction
Managing memory (trying to read the tournament dataset all in one go tends to crash colab)
Fitting models: the notebook includes a fastai tabular learner and a scikit-learn regression model
Evaluating model performance with average per-era correlation and Sharpe score
Generating and formatting predictions and submitting them to the competition with numerapi

You can access the notebook from the link below or from the github repository.

Any feedback on ways to make this a more useful resource would be appreciated. It mostly focuses on topics I found challenging during my first few weeks of working on numerai.

The intent of this project is twofold:

Walking new users through the whole process of getting the current data, fitting a model, making predictions, and submitting to the competition; and
Helping those who want to take advantage of colab GPUs but have found it inconvenient or difficult to deal with secret keys and/or colab’s memory limitations.

mrquantsalot · March 7, 2021, 5:34am

I copied and worked through this notebook it all worked for me out of the box. I thought it was a great introduction that went through all of the steps from data to submitting.

Instead of saving the API credentials in a .env I found it easier to save it as a json file and then just add that filename to my .gitignore.

dliden · March 7, 2021, 9:55pm

Thanks for the comment! That’s a good point – that’s what I do with the .env file (include in .gitignore), but I didn’t mention it in this notebook. I’ll add a note about that; I think it would be a helpful addition.

feroxolas · March 8, 2021, 3:54pm

Thanks for the tips! I also found that I also needed to batch the predictions in order not to crash colab. Currently, I’m training XGBoost on smaller subsets of data (which isn’t ideal) to prevent colab from crashing. Any tips to getting around this, especially if I want to do cross validation without having to use super small training subsets or excessive calculation times?

dliden · March 13, 2021, 5:07pm

Just wanted to say that I did see your question @feroxolas . I haven’t tried XGBoost in colab yet but I will let you know when I do and if I’m able to find a way to run it without crashing colab or compromising on training sample size.

mrquantsalot · March 14, 2021, 5:48pm

I found and ran this notebook Numerai tournamentベースライン | Kaggle in google colab and it worked fine. They do use LGB instead of XGBoost. They cast the features to int8 to save space. They commented out the XGBoost, so I assume it worked since the rest of the code did. The LGMBRegressor worked when I ran it.

mapping = {0.0 : 0, 0.25 : 1, 0.5 : 2, 0.75 : 3, 1.0 : 4}
for c in feature_cols:
    df[c] = df[c].map(mapping).astype(np.uint8)

dliden · April 5, 2021, 9:00pm

Hello,

I’ve made an updated version of the notebook (leaving the old as-is for reference) that incorporates a few improvements in the initial data processing stages. In both notebooks, the “validation” eras from the tournament CSV are concatenated to the training dataset as this format is, in my opinion, easier to use with the common statistics and ML libraries (especially since memory constraints make it impossible to have both datasets loaded in memory in colab at the same time). There are two changes to this process in the new version:

The processed csv (training + validation eras) is saved as a .pkl file instead of as a csv. The saving and loading are much faster after this change.
The processing step – loading the tournament data, extracting validation eras, and concatenating with the training data – is now done using the dask library. This results in substantial speed improvements over reading the csv in chunks using pandas.

Overall, this cut the data preprocessing stage down to less than 1.5 minutes from a ridiculous 4.5ish minutes, and it still works within the constraints of Colab’s RAM limits.

A few other modest changes: I commented out the cells for making submissions to prevent accidental submissions when running through the whole notebook. I also changed the metric used by the fastai model from Pearson’s to Spearman’s correlation. I added a brief note about putting a .env file in the .gitignore for those who choose to use python-dotenv.

Link to new notebook:

And the github repository.

svendaj · June 14, 2022, 9:32pm

Unfortunately with latest V4 dataset it is not working in Colab anymore due to RAM limits. I have stopped using Colab at all and most my notebooks now run in kaggle which have RAM limit at 16GB. Even with that it needs or downcasting or using “_int8” datasets.
Also official First submission notebook will not finish in Colab due to RAM. I have created copy of that notebook in kaggle with downcasting to fit into 16GB.

Topic		Replies	Views
Where do you train your models (if not Google Colab) Tournament	11	1857	September 19, 2024
Making your first submission on Numerai won't finish in google Colab Feedback	0	707	June 14, 2022
Automated numerai submission with Notebook Tournament	17	2210	March 25, 2023
Tournament NN baseline with the new massive data Tournament	3	1392	October 19, 2021
Numerai Example Notebook Throwing Submission Errors Numeraire	1	678	July 17, 2022

Introductory Colab Notebook Addressing Common Challenges

Related topics