I’m a data scientist newbie to the stock market, trying to wrap my head around numerai.
The first shock as a complete newbie has been that the data is so anonymized. After reading docs and forum, I understand: the data that the hedge fund provide is actually one of its assets and apparently expensive to get. Ok, I understand.
However, the fact that the stock ids are not only anonymized but changes randomly from era to era, has me very confused. Aren’t we losing an extremely important bit of the information? To try to give the magnitude of what I am (mis)understanding: the current AI revolution with deep learning has started and continues in big part since people took advantage first of the geometric structure of the data (CNN applied to image) and ongoing with temporal/sequencial structure of the data (RNNs and its variants applied to NLP and any sequence-type data).
Aren’t we losing a lot by removing this information? Just imagine image classification with the position of the pixels randomized, or NLP with the words shuffled.
What is the reason to remove the id consistency over time? That it would somehow make possible a des-anonymization of the data?
If an answer is that the temporal information of each stock is somehow embedded in the row, spread through the features, I would argue that just like we want collective intelligence applying data science over the features, it might be as important to do it over time, rather than having that axis of the information used for us in a fixed way.
In addition, any possible correlation between different stocks over time is lost as we can’t track them over time. Maybe this is better from a stock prediction point of view, I don’t know.
Just trying to understand here, I hope there is something that I missed!
Thank you in advance!