An absolute beginner


As a high school student who is interested in learning about how these models work, how to build them and eventually contributing to Numerai, where does one start in terms of being able to start building such models and eventually competing. What is the most basic starting point for someone like me? and what are the basic steps / requirements/ knowledge that I’d need to acquire in order to start??


At minimum you’ll want knowledge of three areas to some degree:

  1. machine learning
  2. statistics more broadly
  3. programming

There are lots of resources available for basic courses in these areas on sites like coursera, where you can take (or at least audit) most courses for no charge – though obviously what you get out of it is proportional to the effort you put into the courses.

Numerai probably isn’t great as a first project in data science, as here we are blind to the data sources/meaning so some aspects of data analysis aren’t really applicable, and additionally the signal to noise ratio is very low.


Ah, thank you for the response! Will start delving into these areas…


Programming wise would Python be a good place to start and would it be sufficient to learn only this programing language to get started? Or would I need to learn programming in a broader sense
Also, what is meant by a ‘signal to noise ratio’?


Python is a good place to start if you are new to programming, there are many tutorials online for Python and it’s a fairly easy language to pick up.

I will also echo the comment from Daenris. This is not a good project for a first time programmer. This is a very difficult machine learning problem and lots of the techniques you will learn from online tutorials and machine learning material do not apply here. The principles are the same, but the execution is totally different.

I think what Daenris was alluding to with the “signal to noise” comment was that in the Numerai dataset, it’s really difficult to find the “correct” answer. Think of it this way; Imagine you were to play three pieces of music at the same time. It’s fairly easy to “find” the song you wanted from that resulting noise and “extract” it. This is a high signal to noise ratio. Now imagine playing thousands of pieces of music at the same time. Now it’s much harder to “extract” the song you want. This is a low signal to noise (this is very simplified explanation).

In normal machine learning tutorials you see online, the algorithms can predict results to a very high degree of accuracy (ie, the “signal” is much larger than the noise). In the Numerai dataset, the algorithms can barely predict better than 50% accuracy (ie, the “signal” is almost identical to the noise).

I hope that helps. Good luck with the competition.


Thank you for this, themicon. Much appreciated!


a novice question here, why a hedge fund company needed a token to reward the data scientist?
is it to raise fund to run the project or to manage the rewards with smart contracts?
Then, speculators affecting the market price of NMR - is it happening or NMR Holders are the ones?