Feature Selection with BorutaShap

I first read an article on the BorutaShap feature selection algorithm a while ago but had never got it to work properly with the Numerai data. I thought with the release of the new data it might be time to try again. After some initial failures, I dug into the code to see if I could improve things. I found and addressed a couple issues and now I think you all may find it useful:

  1. When set to calculate feature importance SHAP values on the test set, it doesn’t respect era boundaries when splitting the data
  2. It doesn’t actually calculate SHAP values just using the test set, even when set to do so. Rather it uses the whole dataset.

My fixed up version of the code is here I’ll try to get a PR into the main branch, but this should work for now. (caveat - the sampling option will not respect eras so I recommend leaving as False)

import numpy as np
import pandas as pd
from numerapi import NumerAPI
import sklearn
import lightgbm
from BorutaShap import BorutaShap

napi = NumerAPI()

current_round = napi.get_current_round(tournament=8)

# load int8 version of the data
napi.download_dataset("numerai_training_data_int8.parquet", "numerai_training_data_int8.parquet")
df = pd.read_parquet('numerai_training_data_int8.parquet')

# create era integer column for convenience
df["erano"] = df.era.astype(int)
eras = df.erano

# create model to be used by BorutaShap feature selector
# changes to the model choice affect the features that are chosen so there's lot's of room to experiment here
model = lightgbm.LGBMRegressor(n_jobs=-1, colsample_bytree=0.1, learning_rate=0.01, n_estimators=2000, max_depth=5)

# initialize the feature selector
Feature_Selector = BorutaShap(model=model,
                                    importance_measure='shap',
                                    classification=False)

# here I iterate over the 4 non-overlapping sets of eras and perform feature selection in each, then take the union of the selected features
# I'm just using standard 'target' for now, but it would be interesting to investigate other targets as well
# It may also be useful to look at the borderline features that aren't accepted or eliminated
good_features = []
for i in range(1,5):
    df_tmp = df[eras.isin(np.arange(i, 575, 4))]
    eras_tmp = eras[eras.isin(np.arange(i, 575, 4))]
    Feature_Selector.fit(X=df_tmp.filter(like='feature'), y=df_tmp['target'], groups=eras_tmp, n_trials=50, sample=False, train_or_test = 'test', normalize=True, verbose=True)
    good_features+=Feature_Selector.accepted
good_features = list(set(good_features))

The features I got out of running the above are:

good_features = [
                'feature_unwonted_trusted_fixative',
                'feature_introvert_symphysial_assegai',
                'feature_jerkwater_eustatic_electrocardiograph',
                'feature_canalicular_peeling_lilienthal',
                'feature_unvaried_social_bangkok',
                'feature_crowning_frustrate_kampala',
                'feature_store_apteral_isocheim',
                'feature_haziest_lifelike_horseback',
                'feature_grandmotherly_circumnavigable_homonymity',
                'feature_assenting_darn_arthropod',
                'feature_beery_somatologic_elimination',
                'feature_cambial_bigoted_bacterioid',
                'feature_unaired_operose_lactoprotein',
                'feature_moralistic_heartier_typhoid',
                'feature_twisty_adequate_minutia',
                'feature_unsealed_suffixal_babar',
                'feature_planned_superimposed_bend',
                'feature_winsome_irreproachable_milkfish',
                'feature_flintier_enslaved_borsch',
                'feature_agile_unrespited_gaucho',
                'feature_glare_factional_assessment',
                'feature_slack_calefacient_tableau',
                'feature_undivorced_unsatisfying_praetorium',
                'feature_silver_handworked_scauper',
                'feature_communicatory_unrecommended_velure',
                'feature_stylistic_honduran_comprador',
                'feature_travelled_semipermeable_perruquier',
                'feature_bhutan_imagism_dolerite',
                'feature_lofty_acceptable_challenge',
                'feature_antichristian_slangiest_idyllist',
                'feature_apomictical_motorized_vaporisation',
                'feature_buxom_curtained_sienna',
                'feature_gullable_sanguine_incongruity',
                'feature_unforbidden_highbrow_kafir',
                'feature_chuffier_analectic_conchiolin',
                'feature_branched_dilatory_sunbelt',
                'feature_univalve_abdicant_distrail',
                'feature_exorbitant_myeloid_crinkle'
                ]

Making a model using just those 38 features + 80% neutralization, I was able to get a pretty nice model. I’m sure it could be improved further by performing feature selection and training with the alternative targets and then ensembling. Let me know how it goes!

31 Likes

Wow, very nice and detailed explanation including sample code! Afaik tell thats worth of a bonus. I am curious if this BortuaShap approach is model-agnostic or do you need to re-run for other type of models? The other one I am curious about is if this feature-selection had a (slight) negative performance result when compared to your model when using all features.

Also one of the benefits of feature selection I guess, saves a lot of memory (and thus solves memory issues lol)

Please fix the title: s/Bortua/Boruta/

1 Like

fixed, that’s embarrassing :man_facepalming:

1 Like

On what machine have you tried it?
I tried it on google colab with 25gb of ram and it crashed due to lack of ram.

1 Like

Except that you need to be able to load the entire dataset into memory first and be able to run LightGBM at least a few times before you can actually get the “good features”.

Does anyone also get gbm warnings about the max leaf size with the code provided?

[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).

When I compare feature importance to their standard deviation, the std is often higher than the importance (which is the mean). Do you experience the same? Does it make sense to view this like some sort of sharpe of the feature?