The discussion about old models no longer working led me to thinking about a quick analysis of distribution shifts. There are libraries out there for this kind of thing, so maybe there are better ways of doing this. As you will see this is not AI assisted, other than in the autocomplete sense. What I have done is this:
- take the classic feature data, medium
- ignore the targets
- created a semi-supervised type of label,
era//52as a proxy for year - used a holdout and trained a gradient boosted model to predict year
- this is a vector of probabilities for each row in the dataset
- group by the known year
- calculate the average probability vector for each group
- generate a clustering thing
- visualise
Here is the result:
