15

Does Parquet support storing various data frames of different widths (numbers of columns) in a single file? E.g. in HDF5 it is possible to store multiple such data frames and access them by key. So far it looks from my reading that Parquet does not support it, so alternative would be storing multiple Parquet files into the file system. I have a rather large number (say 10000) of relatively small frames ~1-5MB to process, so I'm not sure if this could become a concern?

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

dfs = []
df1 = pd.DataFrame(data={"A": [1, 2, 3], "B": [4, 5, 6]},
                   columns=["A", "B"])
df2 = pd.DataFrame(data={"X": [1, 2], "Y": [3, 4], "Z": [5, 6]},
                   columns=["X", "Y", "Z"])
dfs.append(df1)
dfs.append(df2)

for i in range(2):
    table1 = pa.Table.from_pandas(dfs[i])
    pq.write_table(table1, "my_parq_" + str(i) + ".parquet")
Turo
  • 1,537
  • 2
  • 21
  • 42
  • cant add dummy columns to df1? – human May 22 '18 at 02:21
  • Hi bigdadamann. I tried that, though I end up with a frame of 10k+ columns, where most values are NAN. In my case and each chunk just uses say only ~100-200 columns, so these dummy columns add a lot of overhead. – Turo May 22 '18 at 07:04
  • am thinking of simpler alternatives: is it possible to use a col type of collections? eg a List – human May 22 '18 at 12:12
  • I don't think so. I will surely experiment further, but wanted to learn if parquet way could be a way to go. – Turo May 22 '18 at 13:03
  • parquet does have a property to merge schemas. did you take a look? This is implemented in apache spark – human May 23 '18 at 01:14
  • [Dask SQL](https://dask-sql.readthedocs.io/en/latest/) has a [create_table](https://dask-sql.readthedocs.io/en/latest/api.html?highlight=create_table#dask_sql.Context.create_table) method which can create many tables of different widths. Each table will be a different parquet file, and the `dask_sql` context will manage them. – Paul Rougieux Aug 12 '22 at 15:45

1 Answers1

9

No, this is not possible as Parquet files have a single schema. They normally also don't appear as single files but as multiple files in a directory with all files being the same schema. This enables tools to read these files as if they were one, either fully into local RAM, distributed over multiple nodes or evaluate an (SQL) query on them.

Parquet will also be able to store these data frames efficiently even for this small size thus it should be a suitable serialization format for your use case. In contrast to HDF5, Parquet is only a serialization for tabular data. As mentioned in your question, HDF5 also supports a file system-like key vale access. As you have a large number of files and this might be problematic for the underlying filesystem, you should look at finding a replacement for this layer. Possible approaches for this will first serialize the DataFrame to Parquet in-memory and then store it in a key-value container, this could either be a simple zip archive or a real key value store like e.g. LevelDB.

Uwe L. Korn
  • 8,080
  • 1
  • 30
  • 42
  • Hi xhochy, does extracting it from zip archive still make it lazy evaluated? would you see it as a proper way of doing things Spark way? – Turo May 23 '18 at 19:39
  • No, I wouldn't see it that way. Using a `zip` archive would only be a work-around when your underlying file system cannot cope with the number of files. If you want to exchange these files between tools, use the filesystem and not the archive. – Uwe L. Korn May 24 '18 at 14:53