I have 20,000 ~1000-row dataframes, each of which has a name, in a 170GB pickle file at the moment. I'd like to write these to a file these so I can load them individually, by name. I won't need to query for subsets of these dataframes, only read them to memory in their entireties.
I've considered:
- shelve (slow, not portable)
- hdf5 (not well suited to many small dataframes?)
- parquet / feather / pyarrow (no way to consolidate these in one file?)
- SQL?
Goals:
- I'm looking for something I can set up in just a few lines of code.
- I'd like a file I can
scp
around - Decent read/write speeds
- Parallel read/write (e.g. with multiprocessing)
Tried so far:
- HDF5 via pandas
df.to_hdf()
:
- it takes up more space on disk than csv's (to be fair I'm not using compression yet). (Why do my hdf5 files seem so unnecessarily large?)
- it's slow to write many keys (Limit on number of HDF5 Datasets)
- Many parquet files:
- Ran into quota limits for inodes on the cluster.
- HDF5 via
h5py
serializing strings is a pain.
Trying next:
SQL via df.to_sql()