About 1.5 years ago @jrai and I created an open source library called NumerBlox to simplify the software engineering around Numerai inference pipelines. After hearing your feedback and using it internally for all CrowdCent models we had many insights and integrated everything we have learned into NumerBlox 1.0. We invite you to try out this preview version and give feedback before we merge it.
Quickstart (v4.2. data): GitHub - crowdcent/numerblox at rewrite/numerbloxv1
Overview of NumerBlox functionality: https://github.com/crowdcent/numerblox/blob/rewrite/numerbloxv1/docs/index.md
Advanced new features for end-to-end pipelines: https://github.com/crowdcent/numerblox/blob/rewrite/numerbloxv1/docs/end_to_end.md
After working with NumerBlox for almost 2 years we decided to focus more on going towards end-to-end pipelines to improve reproducibility and robustness. For NumerBlox 1.0 we focused on the following topics:
1. End-to-end pipelines and full scikit-learn compatibility
- Every component can now be used with scikit-learn pipelines. We even developed meta-estimators so ensembling and feature neutralization can be fitted end to end in one pipeline.
- All components can be used standalone, but we believe the most robust models are fully reproducible and can be saved and loaded as one (cloudpickle) file. They should accept raw input and output fully processed predictions. NumerBlox v1 allows for this end-to-end modelling, even if you have a cross-validation setup, are ensembling multiple models and doing feature neutralization.
- An additional benefit is that NumerBlox components now integrate with not only scikit-learn, but also extension libraries like scikit-lego, scikit-llm and @jefferythewind’s Era Splitting models.
Below is an example of what you could build with the new NumerBlox setup. Here multiple cross validation schemes are fitted with an underlying estimator (
XGBRegressor in this case). These models are ensembled and the final prediction is neutralized. An implementation of this example can be found in this example notebook (Section 3).
NOTE: This is also a heads-up for people currently using NumerBlox 0.x. The new version will have some breaking changes for current NumerBlox 0.x pipelines. The old system of
ModelPipeline objects will be deprecated in favor of full compatibility with
- We’ve greatly simplified the package structure and reduced the number of mandatory dependencies. Bulky dependencies like Tensorflow are now optional and only needed if you use specific components like Feature Penalization. Our custom DataFrame structure, NumerFrame, is now completely optional and no processors depend on using NumerFrame.
- An additional benefit of these simplifications to the library is that it allowed us to build a more robust test suite to make sure all components behave as expected.
- After deprecation of v2/v3 data was announced we moved quickly to update our pipelines. The downloaders we refined make it easy to pull the newest data and auxiliary data like feature groups and meta model predictions to make the transition to v4 data easier.
- The new v4.2 data reintroduced feature groups. We added functionality to
NumerFramethat allows you to retrieve feature groups with one line of code to prepare your data for training and inference.
import pandas as pd from numerblox.download import NumeraiClassicDownloader from numerblox.numerframe import NumerFrame from numerblox.prediction_loaders import ExamplePredictions downloader = NumeraiClassicDownloader("data") # Training and validation data downloader.download_training_data("train_val", version="4.2", int8=True) df = NumerFrame(pd.read_parquet("data/train_val/train_int8.parquet")) # Era column eras = df.get_era_data # Get small feature set small_df = df.get_small_feature_data # Get last 100 eras of rain features rain_df = df.get_last_n_eras(100).get_feature_group("rain") # Get v3 equivalent features v3_df = df.get_v3_equivalent_features # FNCv3 features fncv3 = df.get_fncv3_feature_data
- The feature groups allow for new feature engineering techniques like creating aggregate statistics of the groups. GroupStatsPreProcessor implements this and integrates with scikit-learn pipelines.
The new NumerBlox version will work with Python 3.9+. Because we are still in development you can clone NumerBlox v1 from the dev branch using the code below.
git clone --single-branch --branch rewrite/numerbloxv1 https://github.com/crowdcent/numerblox.git pip install poetry cd numerblox poetry install
As an alternative to Poetry it is possible to run
pip install . after you cloned the library.
The new documentation will eventually be on a Github Pages website. For now docs can be found here or can be build locally with
pip install mkdocs mkdocs build mkdocs serve
I always welcome new contributions and feature suggestions. If you are eager to contribute check out the new contributing docs and create a PR that merges to the
Looking forward to hear your feedback and to refine NumerBlox before release! Very grateful for any suggestions on how we can improve this library and simplify competing in both Numerai Classic and Signals for all levels.