Reading DataFrames saved as parquet with pyarrow, save filenames in columns

Question

I want to read a folder full of parquet files that contain pandas DataFrames. In addition to the data that I'm reading I want to store the filenames from which the data is read in the column "file_origin". In pandas I am able to do it like this:

import pandas as pd
from pathlib import Path

data_dir = Path("path_of_folder_with_files")
df = pd.concat(
                pd.read_parquet(parquet_file).assign(file_origin=parquet_file.name)
                for parquet_file in data_dir.glob("*")
            )

Unfortunately this is quite slow. Is there a similar way to do this with pyarrow (or any other efficient package)?

import pyarrow.parquet as pq

table = pq.read_table(data_dir, use_threads=True)
df = table.to_pandas()

score 1 · Accepted Answer · answered Aug 19 '20 at 16:49

1

You could implement it using arrow instead of pandas:

batches = []
for file_name in data_dir.glob("*"):
    table = pq.read_table(file_name)
    table = table.append_column("file_name", pa.array([file_name]*len(table), pa.string()))
    batches.extend(table.to_batches())
return pa.Table.from_batches(batches)

I don't expect it to be significantly faster, unless you have a lot of strings and objects in your table (which are slow in pandas).

answered Aug 19 '20 at 16:49

0x26res

11,925
11
54
108

Thanks. However, this gives me the following: `TypeError: append_column() takes exactly one argument (2 given)`. Any idea? – Carsten Aug 21 '20 at 07:56
what version of pyarrow do you use? `pa.__version__` – 0x26res Aug 24 '20 at 08:41
0.11.1 is my version – Carsten Aug 24 '20 at 12:54
The api is different. Try `table.append_column(pa.column("file_name", pa.array([file_name]*len(table), pa.string()))`. But I recommend you upgrade – 0x26res Aug 24 '20 at 15:09

Reading DataFrames saved as parquet with pyarrow, save filenames in columns

1 Answers1

Linked