I guess I’m not the only one here, who doesn’t have 128GB RAM at hand. So it might be helpful to share, how it is possible to use the full dataset with <8GB RAM.
The basic idea is to split the full dataset into chunks (split by era). Save these chunks as separate parquet files and then load them on the fly in parallel threads.
The result is not a significant compromise on speed. It requires very little RAM and needs only 5 threads to continously read the data from disk. The number of threads might need to be adjusted based on the speed of your disk and the number of CPU cores available, etc
Maybe saving as .npy, .pkl, or .feather will make loading faster. At least it is the case for big tables. Also casting before saving could be good. What’s your take on this?
I’m not sure about their load times, please try if you wish.
But load time is not very relevant. Threads run parallel and read a era worth of data in the background. As long as you have enough threads, load speed doesn’t matter much.
I’ve been trying to train some models on Colab, so I’m running into memory issues a lot there even with the int datasets. I found with Parquet there is a way read chunks of data and select features, but doing any filtering based on the rows takes up too much ram.
I wrote a class for Keras streaming data in batches. It keeps the memory foot print low and you can select the number of rows (the batch size.)
import pandas as pd
import pyarrow.parquet as pq
import numpy as np
from tensorflow.keras.utils import Sequence
class TrainDataBatches(Sequence):
def __init__(self, feature_cols: list, target_cols: list, era_col: str, batch_size: int = 256,
file_path: str = 'train.parquet'):
self.batch_size = batch_size
self.features = feature_cols
self.targets = target_cols
self.era = era_col
self.file_path = file_path
self._index = pq.read_table(file_path, columns=["id"]).to_pandas().reset_index()
self.on_epoch_end()
def __len__(self) -> int:
m = pq.read_metadata(self.file_path)
return int(np.ceil(m.num_rows / self.batch_size))
def on_epoch_end(self):
self.df_train = pq.ParquetFile(self.file_path).iter_batches(
self.batch_size,
columns=self.features + self.targets + self.era)
# self._index = self._index.loc[np.random.shuffle(self._index.index)]
def __getitem__(self, idx) -> pd.DataFrame:
return self.return_data()
def shuffle_data(self, idx: int):
"""
Shuffles training data out of core
"""
# takes up too much ram
ID = list(self._index.iloc[idx * self.batch_size:(idx + 1) * self.batch_size]['id'])
df = pq.read_table(self.file_path,
columns=self.features + self.targets + self.era,
filters=[('id', 'in', ID)]).to_pandas()
df[self.features] = df[self.features].astype(float) / 4.
df[self.era] = df[self.era].astype(float)
return (df[self.features + self.targets], df[self.era]), df[self.features + self.targets]
def return_data(self):
"""
reads data in batches as is.
"""
try:
chunk = next(self.df_train)
ch = chunk.to_pandas()
ch[self.features] = ch[self.features].astype(float) / 4.
ch[self.era] = ch[self.era].astype(float)
return (ch[self.features + self.targets].values, ch[self.era]), ch[self.features + self.targets].values
except StopIteration:
self.on_epoch_end()
return self.return_data()
I have never tried it, but I just had to replicate several of my coworker’s virtual environment/processing scripts and one used dask to train and predict on his models.
I am not sure how it works, and I have failed to make it myself when messing around one day, but I have noticed that his RAM usage when making predictions was much lower than my other 3 coworkers and they all used XGBoost to model the same thing.