# Reducing Memory

When we read the tournament train and test data into memory it takes a lot of memory that could be a problem even if we have a beefy machine. In order to reduce the following function (a popular function from Kaggle notebooks):

``````import pandas as pd

def reduce_mem_usage(df, verbose=True):
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
start_mem = df.memory_usage().sum() / 1024**2
for col in df.columns:
col_type = df[col].dtypes
if col_type in numerics:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)

end_mem = df.memory_usage().sum() / 1024**2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

return df

``````

Although reducing the precision of the features may hurt the performance of the models but here we are dealing with sets of numerical features with low cardinality so the loss will be negligible.

7 Likes

Interesting! I tried to solve this problem a bit differently, as outlined here. The problem with the approach youâve outlined is that `pd.read_csv` will have to load the entire dataset into memory at full precision. Although you do get a reduction in memory usage after the transform in `reduce_mem_usage`, the peak memory usage remains the same. Which means that youâre still bottlenecked on memory (and glacially slow swap, if the machine has it enabled). My approach around that problem is to use converters to ensure that the data is converted to the more succinct dtype at load time. It isnât entirely free, the load time goes up by about a bit, but the memory usage never peaks.

3 Likes

Thanks for sharing, nice solution! Another approach could be using dask dataframe instead of pd.

1 Like

Nice discussion! Indeed that function works very well for Kaggle competitions, but seems a bit too elaborate for Numerai. It may still be helpful for Numerai Signals, though. Here is a nice compact way to do the memory reduction. If I remember correctly it is from a function by @mdo or @jrb.

``````import csv
import numpy as np
import pandas as pd

with open(path, 'r') as f:
dtypes = {x: np.float16 for x in column_names if x.startswith(('feature', 'target'))}
``````

Hope it helps!

1 Like

I get âno such file or directoryâ for basic paths such as :

tournament_path = âhttps://numerai-public-datasets.s3-us-west-2.amazonaws.com/latest_numerai_tournament_data.csv.xzâ

âŚ same as for jrbâs approach. Any fix?

1 Like

Tried, itâs the same, paths work with Pandas.read_csv(), you canât really open a url path with a regular open, but I donât have need or time to refactor the method, kainsamaâs one is working, at least up until the point I need to train my Catboost model :).

1 Like

Try this:

``````import csv
import lzma
from pathlib import Path
from typing import IO, Union

import numpy as np
import pandas as pd

def open_file(file_path: Path, mode: str = "rt") -> IO:
if file_path.suffix == '.xz':
return lzma.open(file_path, mode)
else:
return open(file_path, mode)

def read_csv(file_path: Union[Path, str], dtype=np.float16) -> pd.DataFrame:
if isinstance(file_path, str) and file_path.startswith("https://"):
else:
if isinstance(file_path, str):
file_path = Path(file_path)
with open_file(file_path) as f:
dtypes = {x: dtype for x in column_names if
x.startswith(('feature', 'target'))}
``````
2 Likes

Thanks, appreciate it. It will be friendly to all the other âhackersâ Iâm sure.

They suggest using more efficient datatypes by checking the existing data types and the memory usage:

``````print(training_set.dtypes)
print(training_set.memory_usage(deep=True))
``````

And setting the data type to something smaller by downcasting:

``````training_set['feature_whatever'] = pd.to_numeric(training_set['feature_whatever'], downcast='unsigned')
print(training_set.memory_usage(deep=True))
``````