I first read an article on the BorutaShap feature selection algorithm a while ago but had never got it to work properly with the Numerai data. I thought with the release of the new data it might be time to try again. After some initial failures, I dug into the code to see if I could improve things. I found and addressed a couple issues and now I think you all may find it useful:
- When set to calculate feature importance SHAP values on the test set, it doesn’t respect era boundaries when splitting the data
- It doesn’t actually calculate SHAP values just using the test set, even when set to do so. Rather it uses the whole dataset.
My fixed up version of the code is here I’ll try to get a PR into the main branch, but this should work for now. (caveat - the sampling option will not respect eras so I recommend leaving as False)
import numpy as np import pandas as pd from numerapi import NumerAPI import sklearn import lightgbm from BorutaShap import BorutaShap napi = NumerAPI() current_round = napi.get_current_round(tournament=8) # load int8 version of the data napi.download_dataset("numerai_training_data_int8.parquet", "numerai_training_data_int8.parquet") df = pd.read_parquet('numerai_training_data_int8.parquet') # create era integer column for convenience df["erano"] = df.era.astype(int) eras = df.erano # create model to be used by BorutaShap feature selector # changes to the model choice affect the features that are chosen so there's lot's of room to experiment here model = lightgbm.LGBMRegressor(n_jobs=-1, colsample_bytree=0.1, learning_rate=0.01, n_estimators=2000, max_depth=5) # initialize the feature selector Feature_Selector = BorutaShap(model=model, importance_measure='shap', classification=False) # here I iterate over the 4 non-overlapping sets of eras and perform feature selection in each, then take the union of the selected features # I'm just using standard 'target' for now, but it would be interesting to investigate other targets as well # It may also be useful to look at the borderline features that aren't accepted or eliminated good_features =  for i in range(1,5): df_tmp = df[eras.isin(np.arange(i, 575, 4))] eras_tmp = eras[eras.isin(np.arange(i, 575, 4))] Feature_Selector.fit(X=df_tmp.filter(like='feature'), y=df_tmp['target'], groups=eras_tmp, n_trials=50, sample=False, train_or_test = 'test', normalize=True, verbose=True) good_features+=Feature_Selector.accepted good_features = list(set(good_features))
The features I got out of running the above are:
good_features = [ 'feature_unwonted_trusted_fixative', 'feature_introvert_symphysial_assegai', 'feature_jerkwater_eustatic_electrocardiograph', 'feature_canalicular_peeling_lilienthal', 'feature_unvaried_social_bangkok', 'feature_crowning_frustrate_kampala', 'feature_store_apteral_isocheim', 'feature_haziest_lifelike_horseback', 'feature_grandmotherly_circumnavigable_homonymity', 'feature_assenting_darn_arthropod', 'feature_beery_somatologic_elimination', 'feature_cambial_bigoted_bacterioid', 'feature_unaired_operose_lactoprotein', 'feature_moralistic_heartier_typhoid', 'feature_twisty_adequate_minutia', 'feature_unsealed_suffixal_babar', 'feature_planned_superimposed_bend', 'feature_winsome_irreproachable_milkfish', 'feature_flintier_enslaved_borsch', 'feature_agile_unrespited_gaucho', 'feature_glare_factional_assessment', 'feature_slack_calefacient_tableau', 'feature_undivorced_unsatisfying_praetorium', 'feature_silver_handworked_scauper', 'feature_communicatory_unrecommended_velure', 'feature_stylistic_honduran_comprador', 'feature_travelled_semipermeable_perruquier', 'feature_bhutan_imagism_dolerite', 'feature_lofty_acceptable_challenge', 'feature_antichristian_slangiest_idyllist', 'feature_apomictical_motorized_vaporisation', 'feature_buxom_curtained_sienna', 'feature_gullable_sanguine_incongruity', 'feature_unforbidden_highbrow_kafir', 'feature_chuffier_analectic_conchiolin', 'feature_branched_dilatory_sunbelt', 'feature_univalve_abdicant_distrail', 'feature_exorbitant_myeloid_crinkle' ]
Making a model using just those 38 features + 80% neutralization, I was able to get a pretty nice model. I’m sure it could be improved further by performing feature selection and training with the alternative targets and then ensembling. Let me know how it goes!