A few simple newb questions

blinkin · December 20, 2021, 6:52pm

Hi everyone! I’m fairly new to Numerai, and am working on getting my first model up and running. I’d like to try the NN approach, for my own learning, though I know XGB and other methods are more popular. I’m confused about a couple of things in the dataset:

The “ID”. Is that unique to each asset & consistent throughout eras? Or in the obfuscation does each asset get a unique ID every new era? Just curious if those could be used to construct a history of a particular asset and train a time-based RNN or something
The “target” values we are training to are binned to one of 5 values (or 7 values for some targets), i.e. 0, .25, .5, .75, 1. How should I construct the output of my neural network when predicting targets for submission? Should they be:
a. Also binned to match the training set bins? So construct a multi-class classifier that only outputs 0, .25, etc.
b. Or a continuous prediction between 0 and 1. Most of the examples I’ve seen use a Sigmoid activation as the final layer to achieve this.
I’ve noticed my starter models VERY quickly overfit compared to validation. I’m not subsampling the eras as of yet, so is this likely the main cause of the overfitting? Any good strategies for tackling this? Is shuffling the data in my DataLoader a good first step? Or do I need to subsample? Do you use all 4 eras every epoch? Or cycle eras through epochs? Or throw out 3/4 of the data altogether?

Thanks for any tips you’re willing to offer!

rigrog · December 20, 2021, 8:27pm

If/when the same stock appears in several eras, the id’s will differ as randomly as all the other id’s.
Your predictions can be any real number x, 0 < x < 1. Because your predictions are scored ONLY by their ordering: you don’t want your predictions in 5 or 7 bins, since you can’t control how the 5 or 7 massive multiway ties will be broken (Numerai will break them for you, according to their order in the text file).
[opinion] that overfitting is a consequence of all the given target data being years out of date. That may be about to change, on 12/25/21.

blinkin · December 21, 2021, 10:16pm

Got it, thank you! That sounds a bit different than what I was picturing, super helpful. So it sounds like our “target” numbers are more like a rank ordering prediction of stocks from best to worst, rather than a more direct prediction of performance?

Looking forward to the newest data, I hope that helps. Didn’t we get a bunch of new data this summer as well? Or was that mostly old data, just with extra features and rows interspersed?

rigrog · December 21, 2021, 10:39pm

According to the latest post on “# announcements” (in the chat room), the newest data was just put off until March. It is to include a few new features, and “all of the test targets”.

There was new (“super massive”) data released in late summer. It changed the number of features from 310 to 1050, and also added rows (mostly, I think, by making four era-specific rows, out of each “train” or “validation” row). But the 1050 features seem to be quite redundant: [I tried to put a hyperlink here, to the “tips and tricks” github page. see the graph that follows “Out[7]” ].

Topic		Replies	Views
Basic question of data Tournament	11	1112	June 13, 2021
Suggestion of add in equity id in Tournament Tournament	4	577	June 12, 2021
Taking advantage of Eras Data Science	6	3371	June 10, 2021
Super Massive Data Release: Deep Dive Data Science	81	21360	November 22, 2021
How to solve the Numerai Tournament / lecture by Marcos Lopez de Prado Data Science	12	4252	November 7, 2023

A few simple newb questions

Related topics