Hello! I’m trying to understand the target and how our predictions are scored. If I understand right, it will be better to discretize the predictions into 5 buckets, just as targets are given. Or the score is against a non-discretized version of the target?
If so, giving it discretized so heavily makes it artificially hard to get good predictions, doesn’t it?
Here is a bit of code to show how having many equal target values but a prediction that is a float number (so even if we have almost perfect predictions, it’s still a dense cloud around the target bucket) leads to a pretty bad score:
import numpy as np
import pandas as pd
from numerai_tools.scoring import numerai_corr
def compare(pred, target):
pred = pd.DataFrame({'predictions': pred})
target = pd.DataFrame({'target': target})['target']
return numerai_corr(pred, target)
# 1
out1 = compare([1,2,3,4,5],
[1,2,3,4,5])
# 2
out2 = compare([1,2,3,4,5],
[0,.25,.5,.75,1])
# Now a "similar" target distribution to the actual targets:
# many points at the center.
target = [1] + [50]*100 + [100]
# 3 shorted list
pred = list(range(len(target)))
out3 = compare(pred, target)
# 4 problem: very dense but ordered around that middle point
# Still pretty bad score
pred = [0] + list(np.linspace(0.45,0.55,len(target)-2)) + [1]
out4 = compare(pred, target)
# 5 should we just discretize our output to 5 values for increased scoring?
# This can't be good for the hedge fund using the output, can it?
# A lot of info is being dropped
# Here is the max score again.
pred = target.copy()
out5 = compare(pred, target)
print(f'output 1: {out1[0]}')
print(f'output 2: {out2[0]}')
print(f'output 3: {out3[0]}')
print(f'output 4: {out4[0]}')
print(f'output 5: {out5[0]}')
The output:
output 1: 0.9964884741438282
output 2: 0.9964884741438282
output 3: 0.4684796965347025
output 4: 0.4684796965347025
output 5: 0.9998918220813134