Digging into Sunshine Data

From what I recall, there was not a way to identify assets in the numerai data. Was this to help protect the numerai data assets?

Otherwise, can anyone inform me if it is possible to identify the same asset over time through ordering somehow or via some score matching? Or if the data folks at numerai can give us the asset id on these rows, that would be great.

I would mainly be interested in this to measure the volatility of my predictions. Now that I think more about it, perhaps a lot of assets move in and out of the most outside prediction buckets because of random volatility in my model rather than some true change in the asset. I believe my predictions are usually pretty tight in the middle and the outside recommendations may be too unstable. However, I think others have mentioned that this is where most TC/Corr can be earned. For example, I’d like to track one asset over time as it changes buckets in the target and I’d like to observe what buckets my prediction recommends.

It would also be nice to have this to perform some sequence models. I’m too lazy to get my own data for signals. Thanks.

I don’t expect that anyone at Numerai will give you any of the data you’re asking for. However, I have found that if you compare all samples in era X with all samples in era X+1 you can generally find the one that is surprisingly similar to all the others. In an L2 norm sense.

1 Like

Thanks for the tip. I was thinking that could be the case for 90% of assets. Confirmation from others help.

for each asset (in the 1000s) in each era, need to calculate the L2 norm of the difference between that asset and the next era asset. Do this for 1000 eras… So 1000^3 comparisons are needed to completely map it out.

Any other ideas? I’ll probably make a function to do one asset at a time from start to end. Maybe I’ll just have it running in the background

I’d start with a recent era and compare every sample in that to the previous era. Identify which ones have clear matches, and ignore the others (I believe the set of securities considered in each era changes, so some will have no obvious close match). Then iterate.

I recommend leveraging a GPU to compute these norms in parallel.