About 1.5 years ago @jrai and I created an open source library called NumerBlox to simplify the software engineering around Numerai inference pipelines. After hearing your feedback and using it internally for all CrowdCent models we had many insights and integrated everything we have learned into NumerBlox 1.0. We invite you to try out this preview version and give feedback before we merge it.
Quickstart (v4.2. data): GitHub - crowdcent/numerblox at rewrite/numerbloxv1
Overview of NumerBlox functionality: https://github.com/crowdcent/numerblox/blob/rewrite/numerbloxv1/docs/index.md
Advanced new features for end-to-end pipelines: https://github.com/crowdcent/numerblox/blob/rewrite/numerbloxv1/docs/end_to_end.md
Why a new version of NumerBlox?
After working with NumerBlox for almost 2 years we decided to focus more on going towards end-to-end pipelines to improve reproducibility and robustness. For NumerBlox 1.0 we focused on the following topics:
1. End-to-end pipelines and full scikit-learn compatibility
- Every component can now be used with scikit-learn pipelines. We even developed meta-estimators so ensembling and feature neutralization can be fitted end to end in one pipeline.
- All components can be used standalone, but we believe the most robust models are fully reproducible and can be saved and loaded as one (cloudpickle) file. They should accept raw input and output fully processed predictions. NumerBlox v1 allows for this end-to-end modelling, even if you have a cross-validation setup, are ensembling multiple models and doing feature neutralization.
- An additional benefit is that NumerBlox components now integrate with not only scikit-learn, but also extension libraries like scikit-lego, scikit-llm and @jefferythewind’s Era Splitting models.
Below is an example of what you could build with the new NumerBlox setup. Here multiple cross validation schemes are fitted with an underlying estimator (XGBRegressor
in this case). These models are ensembled and the final prediction is neutralized. An implementation of this example can be found in this example notebook (Section 3).
NOTE: This is also a heads-up for people currently using NumerBlox 0.x. The new version will have some breaking changes for current NumerBlox 0.x pipelines. The old system of Model
and ModelPipeline
objects will be deprecated in favor of full compatibility with scikit-learn
.
2. Simplify!
- We’ve greatly simplified the package structure and reduced the number of mandatory dependencies. Bulky dependencies like Tensorflow are now optional and only needed if you use specific components like Feature Penalization. Our custom DataFrame structure, NumerFrame, is now completely optional and no processors depend on using NumerFrame.
- An additional benefit of these simplifications to the library is that it allowed us to build a more robust test suite to make sure all components behave as expected.
3. Leverage new v4.2 data to the fullest
- After deprecation of v2/v3 data was announced we moved quickly to update our pipelines. The downloaders we refined make it easy to pull the newest data and auxiliary data like feature groups and meta model predictions to make the transition to v4 data easier.
- The new v4.2 data reintroduced feature groups. We added functionality to
NumerFrame
that allows you to retrieve feature groups with one line of code to prepare your data for training and inference.
NumerFrame Examples
import pandas as pd
from numerblox.download import NumeraiClassicDownloader
from numerblox.numerframe import NumerFrame
from numerblox.prediction_loaders import ExamplePredictions
downloader = NumeraiClassicDownloader("data")
# Training and validation data
downloader.download_training_data("train_val", version="4.2", int8=True)
df = NumerFrame(pd.read_parquet("data/train_val/train_int8.parquet"))
# Era column
eras = df.get_era_data
# Get small feature set
small_df = df.get_small_feature_data
# Get last 100 eras of rain features
rain_df = df.get_last_n_eras(100).get_feature_group("rain")
# Get v3 equivalent features
v3_df = df.get_v3_equivalent_features
# FNCv3 features
fncv3 = df.get_fncv3_feature_data
- The feature groups allow for new feature engineering techniques like creating aggregate statistics of the groups. GroupStatsPreProcessor implements this and integrates with scikit-learn pipelines.
Installation of Preview
Library
The new NumerBlox version will work with Python 3.9+. Because we are still in development you can clone NumerBlox v1 from the dev branch using the code below.
git clone --single-branch --branch rewrite/numerbloxv1 https://github.com/crowdcent/numerblox.git
pip install poetry
cd numerblox
poetry install
As an alternative to Poetry it is possible to run pip install .
after you cloned the library.
Documentation
The new documentation will eventually be on a Github Pages website. For now docs can be found here or can be build locally with mkdocs
:
pip install mkdocs
mkdocs build
mkdocs serve
Contributing
I always welcome new contributions and feature suggestions. If you are eager to contribute check out the new contributing docs and create a PR that merges to the rewrite/numerbloxv1
branch.
Looking forward to hear your feedback and to refine NumerBlox before release! Very grateful for any suggestions on how we can improve this library and simplify competing in both Numerai Classic and Signals for all levels.