You want the performance of your model to as much as possible be a stationary process. A model that goes up for 9 months in a row but then down for all of the last 3 months is less preferable than a model which has the 3 down months interspersed evenly throughout the year. These two models could have the same Sharpe ratio but the one with three down months in a row would have higher drawdown. A sophisticated investor would much prefer to see a model with a stationary track record because they tend to be more robust and tend to be more likely to continue to work into the future.
When I say stationary I tend to mean that the performance of your model is statistically similar to flipping a biased coin. Let’s say your model does well in 80% of eras, then your performance should look like flipping a coin with 80% bias on heads. Your performance should look like something like HHHHTHHHTHHHTHHHHT not something like this TTTHHHHHHHHHTTTTTHHHHHHHH i.e. it should lack autocorrelation / be memoryless / not have any long burn periods.
The challenge with stock market data is that almost all of the stock features are not stationary but the goal if for the model built with the features to be stationary. Quant features like value or momentum can work well for years and then stop working or work in the opposite direction for the next few years. Models trained on these non-stationary features will tend to also not have stationary performance and this is why so many quant models don’t generalize well out of sample – they have fit to regimes, they have not found stationary signals.
In a previous posts, Michael gave code for neutralizing models to feature exposures. While there’s no guarantee that this creates stationarity in performance out of sample, in tests it tends to help because feature neutralization will reduce to zero any linear bets on the non-stationary factors. MMC2 and Feature Neutralization
I wanted to open up discussion on this topic as it’s unusual in most machine learning contexts to care about stationarity or the ordering of your performance. I think many Numerai users cared about getting the highest possible mean correlation score and then began to care about getting the best possible Sharpe. I think the next frontier will be reaching stationarity.
Does anyone explicitly try to learn a model to optimize for stationarity? How?
Does anyone look at ADF tests on their performance or on the feature’s performance in their model construction? Or remove features with too much autocorrelation in their correlation with the target from era to era?
How can you train a model on the Numerai training data to ensure stationarity at least over the training set i.e. enforce that you don’t have especially long periods of strong performance or underperformance over the training eras? Bonus: does a model with stationarity over the training set work out of sample better than one without? Extra bonus: if you optimize for stationarity in the training of your model is that better than optimizing for Sharpe?
Great post Richard, much appreciated! These issues have been on my mind recently as I’ve been playing around with fitting models to feature neutral targets. I’ve been testing out the Sortino ratio as an alternative to Sharpe for doing hyperparameter selection, because it makes sense to me to only penalize downside volatility/variance. Interestingly I’m finding that Sortino does favor different and narrower ranges of hyperparameters than Sharpe.
After reading your post and doing some internet searching I came across this document which proposes a modification to Sharpe, they call Smart Sharpe, which takes autocorrelation into account. If anyone is interested I threw together a simple implementation to help clarify it to myself. I also created the “Smart” version of Sortino by including the autocorrelation penalty term to perhaps get the best of both worlds.
def ar1(x):
return np.corrcoef(x[:-1], x[1:])[0,1]
def autocorr_penalty(x):
n = len(x)
p = ar1(x)
return np.sqrt(1 + 2*np.sum([((n - i)/n)*p**i for i in range(1,n)]))
def smart_sharpe(x):
return np.mean(x)/(np.std(x, ddof=1)*autocorr_penalty(x))
def smart_sortino_ratio(x, target=.02):
xt = x - target
return np.mean(xt)/(((np.sum(np.minimum(0, xt)**2)/(len(xt)-1))**.5)*autocorr_penalty(x))
Amazing response! This is why we have a forum! I hadn’t heard of Smart Sharpe but that paper makes a lot of sense. Maybe we should use use code and switch to showing Smart Sharpe over validation when uploading predictions. @master_key
played a little for a few rounds with suggested smart version of sharpe and sortino.
not realized at first from the math that suggested autocorr_penalty(x) in favour of negative auto correlation. Negative autocorrelation means era correlation jumping up and down around mean each next era. Correct me if I’m wrong but desired property of stationary is to have AR1 close to 0, not to -1.
I’m trying loss function that have an inverse value of original function autocorr_penalty() in case of negative autocorrelation:
# In R style
autocorr_penalty2 <- function(x) {
ap <- autocorr_penalty(x)
if(ap < 1) {
return (1/(ap) ) # ap == 0 when AR1(x) == -1
} else {
return (ap)
}
}
Yeah, I had wondered about that too and after thinking more I think you’re right. I’m guessing the paper didn’t address this because negative AR1 coefficients just don’t happen in the long time-series data they are analyzing. To prevent wonkiness when using as a penalty I agree with @of_s that you should just modify the function to:
def autocorr_penalty(x):
n = len(x)
p = np.abs(ar1(x))
return np.sqrt(1 + 2*np.sum([((n - i)/n)*p**i for i in range(1,n)]))
It it is a mystery to me how a model could know about the order of eras. How?
Let me answer that question. By introducing an era variable.
Then if you want to use out of sample eras to train parameters via CV the only correct CV to use is time-series CV, otherwise a data leak is introduced. There are numerous problems with time series CV, eg,
inefficiency
the chosen parameters are not optimal for the size of the final dataset
0.02 percent (percent!) of people are actually do it or even know how to do it.
Let me give a little more detail on some types of era variables and their significance:
A categorical variable: Its just a grouping variable. In theory it cannot tell you anything about the order of eras. Some kind of grouping variable is used for ranking. Loss functions that utilize auto-correlations cannot make use of it for optimization and there is no problem with data leakage so any kind of CV scheme can be used.
A real or integer ordinal variable. This type of variable introduces a data leak. Only time-series CV can be used or you will overfit. Loss functions that utilize auto-correlation will definitely see improved CV at your peril.
A real nonordinal context variable: Context variables can be engineered in any way so I am talking about context variables that are specifically not ordinal with respect to time by design. But one has to be cautious with them. If they are well designed any type of CV can be used to get improved model parameters. Loss functions that utilize time-based auto-correlation probably will not see improvement from them. But if they do then your model is approaching chaos since nearby in time eras may have similar context variables.
Great thread! Does the current Sharpe calculation in the diagnostics panel (Validation Sharpe) already include this autocorrelation penalty? Also, are there any plans to add (smart) Sortino to the diagnostics panel?
I can tell you that the Validation Sharpe currently displayed on the website does not include autocorrelation penalty. And that’s because I calculate the metrics locally and the numbers displayed on the website match my local results. I’m not in a position to answer your second question, I hope someone from the team will.
No plans for Sortino right now. It did seem to be quite good but also very similar to Sharpe. I remember at one point we had some success with models trained on custom loss function of smart Sortino.
Help me understand why you use p**i. This paper uses subscript i to indicate that it’s autocorrelation coefficient p at lag i, which would be ar(x[:i]).