Meta-analysis: clustering model performances

I’ve been doing some analysis of live models I thought I would share. It could be used to help you understand how diversified your models are relative to other competitors. It’s a work in progress, so please feel free to share ideas to how this can be improved.

Overview of analysis

I embed live models into a 2D space using the UMAP algorithm. The end-of-round correlations for a set of resolved rounds are the variables used for the embedding. The 2D space can, therefore, be considered an abstract, long-term correlation space between models. In this space, models closer in proximity share more similar end-of-round correlations (note, this does not necessarily mean their predictions are similar, just that the overall correlation by round is similar).

I did the analysis twice for rounds 215-237 and rounds 221-245. There is temporal overlap but this provides some insight into how model performances are evolving with time. In future this could be (easily) calculated as a rolling-average per round which can give you an idea of temporal changes in this space.

Some key caveats:

  • Currently, I only take models which submit every round. UMAP can’t deal with NaNs, so this avoids imputation. For models missing one round it seems reasonable that mean imputation will not have a signifcant impact on structure.
  • Exact structure of the embeddings should be taken with a pinch of salt, depends heavily on hyperparameter selection. But these can be fixed to analyse temporal changes.
  • I believe UMAP does a ‘better’ job at preserving global structure than t-SNE, but this needs to be investigated.

Model embeddings

The following figures are the UMAP embeddings in the 2D space. Left panels are coloured by mean end-of-round CORR and the right panels mean end-of-round MMC.

Global structure

So far I have high level information on the following models:

model_name Type Feature neutralization
krat NN ?
trivial NN ?
floury_kerril_moodle LinearRegression No
integration_test_7 GBT No
budbot_7 GBT Yes

I’ve labelled (where possible) where these models fall in the embeddings. Noting the differences between these models can help you understand the structure of the embedded space.

For example, I don’t think it is coincidence that budbot_7 (100 % feature neutralised model) is diametrically opposed to the linear model (floury_kerril_moodle) in the round 215–>237 figure. Additionally, it seems likely that models near integration_test are gradient boosted models without substantial feature neutralisation.

I’ve found a group of models (robprofit, wwmodel2, wwmodel3, wwmodel4, wwmodel5,…) who are somewhat anomalous - they are doing well MMC and CORR but with round correlation’s relatively dissimilar to other models.

Temporal changes

There are several changes between the two figures (round 215–>237 and 221–>245):

  • Note how the mean correlation is much higher in later rounds (left panel of 221–>245 is consistently a darker shade of blue)
  • In the earlier rounds (215–>237) MMC tended to be localized in fewer models (particularily in Feature neutralized models around budbot_7). In later rounds (221–>245), MMC tends to be more ‘spread out’ (light shades of blue).
  • I assume there is a greater number of feature neutralized GBTs being used now (blob around budbot_7 is bigger in 221–>245 compared to 215–>237).
  • Models similar to integration_test_7 have a low MMC contribution - but there are some exceptions.

What can this be used for?

Ideally, this is a visualisation which needs to be interactive so it can be explored (previously discussed in chat). I’m keen to work on this if there is sufficient merit in the visualisation. Such a visualisation can help perform meta-analysis of the competition, get an understanding of the diversity of your models relative to other competitors, and interestingly it is plausible that you can use this to predict what type of model a competitor is using by projecting it into this space. More data is required but this could get quite involved. For example, I wouldn’t be surprised if a certain dimension corresponds to the degree of feature neutralisation or linearity of the model to the features.


Have you thought about splitting up models that clearly changed during the time interval you looked at? I’m not sure we have a simple way of doing that, other than the rather crude “correlation with metamodel”.

Its somewhat of a shame that the diagnostics stats are not publically available. They would help enormously in teasing apart when models changed. @richardcraib, maybe the diagnostics could be open?

1 Like

I was concerned about this but decided the only mitigations I could do for now was to use as ‘short’ a time window as possible and hope competitors didn’t change their model.

I’ve ran these with a round-by-round rolling window and you could identify models which are changed by looking at big changes throughout the space with time. This needs some thought though as the rotational invariance of UMAP is causing me problems, I have some potential solutions for this but not had a chance to implement them.

Will share some code snippets to reproduce these soon.