2

I have 20,000 ~1000-row dataframes, each of which has a name, in a 170GB pickle file at the moment. I'd like to write these to a file these so I can load them individually, by name. I won't need to query for subsets of these dataframes, only read them to memory in their entireties.

I've considered:

  • shelve (slow, not portable)
  • hdf5 (not well suited to many small dataframes?)
  • parquet / feather / pyarrow (no way to consolidate these in one file?)
  • SQL?

Goals:

  • I'm looking for something I can set up in just a few lines of code.
  • I'd like a file I can scp around
  • Decent read/write speeds
  • Parallel read/write (e.g. with multiprocessing)

Tried so far:

  1. HDF5 via pandas df.to_hdf():
  1. Many parquet files:
  • Ran into quota limits for inodes on the cluster.
  1. HDF5 via h5py serializing strings is a pain.

Trying next:

SQL via df.to_sql()

Alex Lenail
  • 12,992
  • 10
  • 47
  • 79
  • Did you consider SQLite via to_sql? – Micah Kornfield May 14 '22 at 03:26
  • @MicahKornfield I'm starting to feel like that's what I'm looking for. Will try it next. How has it gone for you? Any tutorials to recommend with tips for how to make it performant? Assuming all the tables have the same columns, would you concatenate them into a table with an extra column, or make 100k tables? – Alex Lenail May 15 '22 at 15:24
  • I haven't done performance tuning with SQLite for a while. Given the requirements I would try individual tables in SQLite (haven't verified if there is a table limit there) otherwise you will want to create an index on the filename column, which can slow down writes or use a single extra column with the parquet solution below (100000 row groups kind of stretches assumptions on parquet files though). – Micah Kornfield May 15 '22 at 17:30

2 Answers2

0

I like HDF5. It is simple enough to write to HDF5 from a Pandas dataframe, and read the data back in. (Note: I prefer using numpy and h5py to work with HDF5. They create a smaller file. It's worth looking at if you don't have to use Pandas.) Here is a very basic example with pandas. I used variables to size the dataframe and the number of names/keys so you can experiment.

Code to create the dataframes and write to HDF5:

import pandas as pd
import numpy as np

a0, a1 = 1_00, 2
ndf = 20
for df_name in ['name_'+str(i) for i in range(ndf)]:
    arr = np.random.randint(19,99,size=(a0,a1))
    df = pd.DataFrame(data=arr, columns=["col_1", "col_2"])
    df.to_hdf('SO_72178611.h5', df_name, mode='a')

Code to load HDF5 data to a dataframe (note that you don't have to know the names (keys) a priori: (comment out the print statements if you run with large numbers)

with pd.HDFStore('SO_72178611.h5') as store:
    for df_name in store.keys():
        df = store.get(df_name)  
        print(df_name)
        print(df)
kcw78
  • 7,131
  • 3
  • 12
  • 44
0

You could try reading in all the CSV's as one big dataframe in pandas and then convert said dataframe to parquet format using pyarrow via:

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.Table.from_pandas(df)
pq.write_table(table, 'output_path/example.parquet',compression='snappy')

Where the df is the pandas dataframe where you've concatenated all the CSV's. This may not work if the CSVs are different shapes. You may also want to play with creating a parquet dataset to increase read speeds. Additionally, you can play with different compression types to get a smaller file at the cost of read speed using something like Gzip. More information can be found here. https://arrow.apache.org/docs/python/parquet.html

djo
  • 119
  • 2
  • 14
  • Something to note. Creating a parquet dataset (creating partitions) will yield multiple files but it can easily be read using pyarrow: "dataset = pq.ParquetDataset('dataset_name/')" "table = dataset.read()". Read the link I sent in the answer it'll clarify things much better. If you do not partition it an simply use the solution I suggest above it'll only yield one file. – djo May 12 '22 at 15:58
  • Is there a way to read back out specific rows from the parquet file which you're suggesting I concatenate? – Alex Lenail May 12 '22 at 22:05
  • If you add file name as a column to the dataframe you should be able to fairly efficiently retrieve the corresponding rows – Micah Kornfield May 14 '22 at 03:27
  • You can also read in parquet as a data frame using pyarrow or pyspark and then specify a row you'd like to select. – djo May 15 '22 at 03:58