The new data files are more than double the size:
v5.1 validation.parquet is 7.3 GB today versus 3.3 GB for v5.0
This will impact models that are memory-constrained (or GPU memory-constrained) during training. If a model trained on all features of v5.0 is getting close to a memory limit when training it is likely to run out of memory if attempting to train on v5.1 unless a subset of features is selected or the number of eras in the training data is reduced.
When new data is added to validation.parquet each week, the only way for participants to fetch the new data is to re-download the entire 7.3 GB file.
Has anyone considered that if the data format was CSV instead of parquet, an HTTP GET feature could allow the client to just download the new rows, saving a lot of time and network bandwidth at the numerai server.