Open Source Datasets for Numerai Signals

thomasxthomas · March 3, 2022, 4:12pm

2022-03-03: version 1.0

Preparing data sources is the biggest challenge for Numerai Signals as it is extremely time-consuming. In view of this, I want to start a community project to share and document open-source datasets for Numerai Signals.

The current goal is to maintain high-quality training and validation datasets for Numerai Signals and provide benchmark performance on these datasets. This benchmark can be used to decide whether a new data source or features can improve model performance.

I have created a historical stock metadata mapping of US stocks in the Numerai universe, covering the time period from 2003 to 2020. The stock metadata is mapped with academic research quality databases so that it does not have survivor bias. I also created a very simple dataset of normalised price and financials features using data sources from CRSP, Compustat and OptionMetrics, obtained through WRDS.

The data can be found in the following GitHub

numerai-signals/data at main · ThomasWong2022/numerai-signals (github.com)

This is a work in progress. Please open an issue on GitHub for data sources and feature engineering suggestions.

maxchu · March 4, 2022, 12:01am

Thanks for the great work! Is it possible that this project can be integrated into https://github.com/councilofelders/opensignals ?

thomasxthomas · March 4, 2022, 10:35am

As I am using data from academic research (which is not available in real-time), the dataset cannot be used to submit the live predictions for Numerai Signals. It can be integrated into the opensignals repo once I finish the draft of my paper on this dataset.

maxchu · March 5, 2022, 12:43am

oh yes, it cannot be used in live predictions. But it is still a very good resource for ppl who can test their ideas on the dataset, if it turns out good then ppl can buy similar data from data companies.

I am not sure if numeral can help to expand/improve this type of free historical data to lower the entry barrier.

thomasxthomas · March 5, 2022, 12:16pm

The purpose of this dataset and benchmark is to encourage the community to maintain open-source/low-cost data feeds by sharing the time and costs of preparing data feeds.

While most data license terms do not allow sharing of the raw data itself, it is usually allowed to share processed data as in Open Source Asset Pricing.

Data – Open Source Asset Pricing (openassetpricing.com)

The Numerai features are created in a conceptually similar process with better data sources and different factors to neutralise features and targets against, such as sector and country. The big difference is that the data generation process will be open-source for my dataset.

thomasxthomas · March 8, 2022, 2:04pm

2022-03-08: version 2.0
Added sentiment data from Ravenpack

Topic		Replies	Views
Signals sources: List? Signals	2	1079	January 20, 2022
The Signals Meta Model Has Been Released: Here Are The Feature Exposures Signals	9	2268	August 23, 2023
Train/validation dates Signals	3	764	May 19, 2021
Using LLMs to Create Trading Signals Signals	9	17259	June 2, 2026
[Proposal] Improving Signals Competition Signals	5	1275	April 21, 2022

Open Source Datasets for Numerai Signals

Related topics