Reducing Memory

kainsama · May 3, 2020, 7:22am

When we read the tournament train and test data into memory it takes a lot of memory that could be a problem even if we have a beefy machine. In order to reduce the following function (a popular function from Kaggle notebooks):

import pandas as pd

def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

training_data = reduce_mem_usage(pd.read_csv("https://numerai-public-datasets.s3-us-west-2.amazonaws.com/latest_numerai_training_data.csv.xz"))
training_data.head()

Although reducing the precision of the features may hurt the performance of the models but here we are dealing with sets of numerical features with low cardinality so the loss will be negligible.

jrb · May 3, 2020, 1:19pm

Interesting! I tried to solve this problem a bit differently, as outlined here. The problem with the approach you’ve outlined is that pd.read_csv will have to load the entire dataset into memory at full precision. Although you do get a reduction in memory usage after the transform in reduce_mem_usage, the peak memory usage remains the same. Which means that you’re still bottlenecked on memory (and glacially slow swap, if the machine has it enabled). My approach around that problem is to use converters to ensure that the data is converted to the more succinct dtype at load time. It isn’t entirely free, the load time goes up by about a bit, but the memory usage never peaks.

randperson · July 12, 2020, 4:09pm

Thanks for sharing, nice solution! Another approach could be using dask dataframe instead of pd.

perfect_fit · November 5, 2020, 12:39pm

Nice discussion! Indeed that function works very well for Kaggle competitions, but seems a bit too elaborate for Numerai. It may still be helpful for Numerai Signals, though. Here is a nice compact way to do the memory reduction. If I remember correctly it is from a function by @mdo or @jrb.

import csv
import numpy as np
import pandas as pd

def read_csv(path):
    with open(path, 'r') as f:
        column_names = next(csv.reader(f))
        dtypes = {x: np.float16 for x in column_names if x.startswith(('feature', 'target'))}
    return pd.read_csv(path, dtype=dtypes)

Hope it helps!

oxioxi · December 13, 2020, 12:05pm

I get “no such file or directory” for basic paths such as :

training_path = ‘https://numerai-public-datasets.s3-us-w2.amazonaws.com/latest_numerai_training_data.csv.xz’

tournament_path = ‘https://numerai-public-datasets.s3-us-west-2.amazonaws.com/latest_numerai_tournament_data.csv.xz’

… same as for jrb’s approach. Any fix?

restrading · December 14, 2020, 5:47am

Try these:
https://numerai-public-datasets.s3-us-west-2.amazonaws.com/latest_numerai_training_data.csv.xz
https://numerai-public-datasets.s3-us-west-2.amazonaws.com/latest_numerai_tournament_data.csv.xz

oxioxi · December 14, 2020, 8:41am

Tried, it’s the same, paths work with Pandas.read_csv(), you can’t really open a url path with a regular open, but I don’t have need or time to refactor the method, kainsama’s one is working, at least up until the point I need to train my Catboost model :).

restrading · December 14, 2020, 8:56am

If someone is using plain open(), it’s implied the file has already been downloaded prior.

jrb · December 15, 2020, 12:27pm

Try this:

import csv
import lzma
from pathlib import Path
from typing import IO, Union

import numpy as np
import pandas as pd


def open_file(file_path: Path, mode: str = "rt") -> IO:
    if file_path.suffix == '.xz':
        return lzma.open(file_path, mode)
    else:
        return open(file_path, mode)


def read_csv(file_path: Union[Path, str], dtype=np.float16) -> pd.DataFrame:
    if isinstance(file_path, str) and file_path.startswith("https://"):
        column_names = list(pd.read_csv(file_path, nrows=1).columns)
    else:
        if isinstance(file_path, str):
            file_path = Path(file_path)
        with open_file(file_path) as f:
            column_names = next(csv.reader(f))
    dtypes = {x: dtype for x in column_names if
              x.startswith(('feature', 'target'))}
    return pd.read_csv(file_path, dtype=dtypes, index_col=0)

oxioxi · December 15, 2020, 1:04pm

Thanks, appreciate it. It will be friendly to all the other “hackers” I’m sure.

goldnumberone · January 19, 2021, 4:09pm

Pandas has a section on loading large datasets: https://pandas.pydata.org/docs/user_guide/scale.html

They suggest using more efficient datatypes by checking the existing data types and the memory usage:

print(training_set.dtypes)
print(training_set.memory_usage(deep=True))

And setting the data type to something smaller by downcasting:

training_set['feature_whatever'] = pd.to_numeric(training_set['feature_whatever'], downcast='unsigned')
print(training_set.memory_usage(deep=True))

Another option that’s mentioned is using Dask: https://docs.dask.org/en/latest/dataframe.html

Topic		Replies	Views
Faster data loading with datatable Data Science	6	1812	April 9, 2021
About the new dataset and RAM usage Tournament	4	2591	February 15, 2022
Saving memory with uint8 features Data Science	1	2766	December 20, 2022
How to manage the compute drawback Tournament	7	998	December 21, 2021
Making your first submission on Numerai won't finish in google Colab Feedback	0	707	June 14, 2022

Reducing Memory

Related topics