Saving memory with uint8 features

jrb · April 23, 2020, 9:14pm

I’ve noticed that there have been quite a few complaints in the chat rooms about memory usage on compute since the test set got bigger. I’d originally contributed the float16 patch to example_model.py to help alleviate this, a couple of months ago. I just realized that there’s a simple way improve on this and to nearly halve the memory usage in your inference pipeline if you’re using a tree based classifier (Xgboost, Catboost, Lightgbm, etc).

We just need to rescale the features and cast them to uint8. This further halves the memory usage for feature columns. Note that trick won’t work on the target column because that’s still a floating point value.
The intuition is that with tree based models like XGBoost, rescaling the inputs (and training the model with the rescaled) won’t affect the model’s performance.

I’ve tested this on example_model.py and verified that it generates the same predictions with and without the change. To try this for yourself, simply replace the read_csv function in example_predictions.py (or your derivative thereof) with the following function:

def read_csv(file_path):
    to_uint8 = lambda x: np.uint8(float(x) * 4)
    with open(file_path) as f:
        column_names = next(csv.reader(f))
    dtypes = {TOURNAMENT_NAME: np.float16}
    converters = {x: to_uint8 for x in column_names if x.startswith('feature')}
    return pd.read_csv(file_path, dtype=dtypes, converters=converters).set_index("id")

anakin_sky_walker · December 20, 2022, 3:33am

Can I ask, is this working on the training process? It seems to me 32GB RAM still not sufficient for the full datasets.

Topic		Replies	Views
About the new dataset and RAM usage Tournament	4	2579	February 15, 2022
Running example model with less than half a gig of RAM Data Science	1	2507	April 19, 2021
Huge memory (and speed) differences between v4.1 and v4 data Tournament	9	1107	September 18, 2023
Faster data loading with datatable Data Science	6	1808	April 9, 2021
Training the example data doesn't work Tournament	3	900	April 1, 2021

Saving memory with uint8 features

Related topics