When we read the tournament train and test data into memory it takes a lot of memory that could be a problem even if we have a beefy machine. In order to reduce the following function (a popular function from Kaggle notebooks):
import pandas as pd
def reduce_mem_usage(df, verbose=True):
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
start_mem = df.memory_usage().sum() / 1024**2
for col in df.columns:
col_type = df[col].dtypes
if col_type in numerics:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
end_mem = df.memory_usage().sum() / 1024**2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
training_data = reduce_mem_usage(pd.read_csv("https://numerai-public-datasets.s3-us-west-2.amazonaws.com/latest_numerai_training_data.csv.xz"))
training_data.head()
Although reducing the precision of the features may hurt the performance of the models but here we are dealing with sets of numerical features with low cardinality so the loss will be negligible.
Interesting! I tried to solve this problem a bit differently, as outlined here. The problem with the approach youâve outlined is that pd.read_csv will have to load the entire dataset into memory at full precision. Although you do get a reduction in memory usage after the transform in reduce_mem_usage, the peak memory usage remains the same. Which means that youâre still bottlenecked on memory (and glacially slow swap, if the machine has it enabled). My approach around that problem is to use converters to ensure that the data is converted to the more succinct dtype at load time. It isnât entirely free, the load time goes up by about a bit, but the memory usage never peaks.
Nice discussion! Indeed that function works very well for Kaggle competitions, but seems a bit too elaborate for Numerai. It may still be helpful for Numerai Signals, though. Here is a nice compact way to do the memory reduction. If I remember correctly it is from a function by @mdo or @jrb.
import csv
import numpy as np
import pandas as pd
def read_csv(path):
with open(path, 'r') as f:
column_names = next(csv.reader(f))
dtypes = {x: np.float16 for x in column_names if x.startswith(('feature', 'target'))}
return pd.read_csv(path, dtype=dtypes)
Tried, itâs the same, paths work with Pandas.read_csv(), you canât really open a url path with a regular open, but I donât have need or time to refactor the method, kainsamaâs one is working, at least up until the point I need to train my Catboost model :).