To put it plainly, the primary issue with the recent performance of the fund is TC, the second issue is TC, and the third issue is also TC…
TC may be an optimal metric from Numerai’s perspective, but it has several serious problems:
-
What TC claims to be and what it actually is are not the same at all. TC should represent the true contribution of a model to the metamodel. For this, it’s essential that TC is calculated as “leave one out.” Initially, TC was calculated as an average of several out-of-bag (OOB) values. This made some sense, but if a model had the misfortune of being in OOB with other poor models, it negatively affected the calculation. For an accurate average OOB calculation, extensive resampling would be necessary. Both leave one out (LOO) and OOB methods consume excessive CPU resources.
-
Then came the new TC calculation method, based on the gradients of a layer in optimization software used for metamodel calculation. All good, right? Well, NO. This way of calculating TC is what’s known as “in-data,” meaning it uses the same data used to train the model. Any in-data performance calculation is overfit and biased toward data (variables, where in this case each variable is a model) that provides more degrees of freedom to the process being tuned/optimized. Models that are very different from the rest and/or have a lot of random noise offer many degrees of freedom to the process, and it overfits them. This overfitting is reflected in high gradients, and Numerai interprets this as good models with high TC.
-
Let’s draw a parallel with something we all know. Consider a gradient boosting model tuned with lightgbm where each variable is a model. TC would be equivalent to calculating the relative importance of each variable (model). What happens with the built-in importances provided by lightgbm? We all know that they are overfit to variables with higher cardinality or greater variability, simply because they provide more degrees of freedom (more candidates for split points) to the process. This is easily seen by adding several purely random variables to the model and calculating their built-in importance to find that some are surprisingly “important.” How do we know the correct way to calculate the importance of a variable (model)? That’s why other techniques are used to measure variable importance, such as randomly permuting their values and observing the impact on performance. The same thing happens with TC as it’s currently calculated. It’s an overfitted and biased measure toward models that are different from the rest and have variability (random noise). Many data scientists have achieved high TC scores by playing by these rules, without paying attention to CORR, even forcing CORR to be negative as another way to differentiate themselves from the rest and be attractive to the metamodeling process, thereby obtaining high TC scores. The fund has bet on these models, and there are the results obtained. In a way, Numerai has achieved “WYGIWYP” (What you get is what you paid for).
-
Last but not least, even in the case where TC is the best metric and it’s well calculated, what good is a metric that data scientists cannot reproduce or calculate? These two characteristics must be essential for any metric. Without them, it’s impossible to optimize them. What have people done? Test, test, and test. We’ve started a “genetic” optimization race in which successful experiments have led to the next generation of experiments. Models with correlations of 0.10 with the metamodel, even some with negative correlations. Anything goes if the TC lottery yields a positive value. We’ve stopped optimizing reasonable metrics and started searching for the philosopher’s stone.
Now, my two cents: Numerai needs CORR and diversity. Why not reward precisely that? CORR and low correlation with the metamodel. Something easy to calculate and optimize. I believe it’s time to acknowledge that the experiment has not gone well, and common sense should prevail.