Hi everybody,
I saw that some users complained about the RAM usage after the “super massive data” release. I experienced some crashes when trying to fit LightGBM models but finally managed to make it work under 16GB memory. I wanted to share my experience because it might help somebody here, one day…
At first, I was just loading the int8
Parquet training file with Pandas (so far so good). Then, when passing the DataFrame to the train
method of LightGBM, the memory usage went above my computer limits. According to issue #1032, with the Python package, LightGBM internally converts values to float32
, hence the memory explosion. A solution to avoid this is to save data to a CSV file and construct a Dataset object from that file.
Now, one might say “ok, that’s easy, Numerai already provides training data in CSV format”. Well, even when set to be ignored, the id
and era
columns trigger errors because, as per the documentation: “despite the fact that specified columns will be completely ignored during the training, they still should have a valid format allowing LightGBM to load file successfully”.
So, I simply selected the target and the features from the Parquet file and saved them to CSV using the PyArrow library (which was much faster than using Pandas). However, I am pretty sure there’s a clever way to generate an adequate CSV file by filtering columns from the Numerai CSV file with a Linux command (but I haven’t checked it yet).
And I could finally perform some hyperparameter search.
On a side note, if you want to delete a Dataset object and create a new one, you might encounter some high memory usage because of issue #4239: a memory leakage happens since version 3.0.
That’s it!