You may be familiar with the idea of “removing risky features” from around cell 102 in the analysis_and_tips notebook or from the example_script.
These analyses aim to find features which behave very differently from one era to the next, or which are behaving in recent times very differently from how they behaved in the past.
The risk with features like these is that your model might learn to really like a feature and become reliant on it, but then in live, the feature is behaving completely differently and so your model flounders.
There are 5 features which are particularly bad offenders:
['feature_unsustaining_chewier_adnoun',
'feature_coastal_edible_whang',
'feature_trisomic_hagiographic_fragrance',
'feature_censorial_leachier_rickshaw',
'feature_steric_coxcombic_relinquishment']
If you look at these features’ correlations with the target over time, you will see that they are very consistently negatively correlated for most of the data, but in more recent times have almost 0 correlation with the target.
Here is a cumulative sum of these features correlation with the target, to make the shift easy to see.
Typically, we try to stick to a philosophy of giving the users all of the features, even if we don’t think it’s necessarily good to use them, with the hope that the users will be better at deciding which features to use, and how to use them, than we are.
However in this case, we find that the features’ behavior differences are so problematic to the point that no model should be using these features in any way.
In one simple test, we find that removing these features entirely can boost performance from a correlation of about 0.023 up to 0.025 over the validation eras, and a similar level of performance boost will continue into the future.
These 5 are the worst offenders, but we also suggest removing 5 more features as well due to a similar reason. Below are the complete lists of features to remove from your models in the v4 and v3 datasets.
v4:
['feature_palpebral_univalve_pennoncel',
'feature_unsustaining_chewier_adnoun',
'feature_brainish_nonabsorbent_assurance',
'feature_coastal_edible_whang',
'feature_disprovable_topmost_burrower',
'feature_trisomic_hagiographic_fragrance',
'feature_queenliest_childing_ritual',
'feature_censorial_leachier_rickshaw',
'feature_daylong_ecumenic_lucina',
'feature_steric_coxcombic_relinquishment']
v3:
['feature_base_ingrain_calligrapher',
'feature_unvaried_social_bangkok',
'feature_deliberative_connatural_kinetoscope',
'feature_haziest_lifelike_horseback',
'feature_accusatory_disinfectant_deportment',
'feature_exorbitant_myeloid_crinkle',
'feature_jerkwater_eustatic_electrocardiograph',
'feature_undivorced_unsatisfying_praetorium',
'feature_direst_interrupted_paloma',
'feature_lofty_acceptable_challenge']
These features are not present in v2 data.
We will not be changing the construction of the v3 and v4 datasets, because we are very sensitive to disrupting user pipelines.
Future data releases will not include any of these features of course.