@jrb Thank you for the write-up! Let me expand on my motivation for doing this.
Because I focus on the new users who are more likely hobbyists or are scientists from other fields (in other words, not pure ML folks), I want to streamline all the things. Having taught undergraduates to participate in the tournament for two semesters, I am very aware of how confusing the workflow can be. Further, @master_key was quoted saying that Pandas and SKLearn encompasses “99%” of the typical user’s needs, and I think he’s right.
I want people to be able to use at most 4 packages and for the data to exist in Pandas dataframes, not as scattered numpy objects throughout their pipeline. I don’t want a new user to have to fiddle with reducing memory usage or converting between various file formats. Additionally, whatever format is used, should be directly plug and play with R, so that those users also have a similar “path to victory” as Python users.
Parquet meets my requirements at first glance. Feather does as well. There are likely other formats, too. What I like most about Feather is that Wes McKinney is the developer of Pandas and Feather, and the collaboration with Hadley Wickham gives me a lot of confidence that a change in Feather will be simultaneously applied to R and Pandas.
Here’s Wes on Feather and Parquet:
What I am not very concerned with is the storage size. I am ambivalent between loading speed and download speed, desiring only that the downloaded file is read into memory as a dataframe with the dtypes preserved. This write-up shows a concerning memory consumption growth when loading parquet files, which feather does not suffer from.
In the end, either option is significantly better than the CSV files, and now users can choose for themselves. Hopefully someday soon the team will offer the data files in several formats with direct links (and maybe even shortened URLs)