I want to read a folder full of parquet files that contain pandas DataFrames. In addition to the data that I'm reading I want to store the filenames from which the data is read in the column "file_origin". In pandas I am able to do it like this:
import pandas as pd
from pathlib import Path
data_dir = Path("path_of_folder_with_files")
df = pd.concat(
pd.read_parquet(parquet_file).assign(file_origin=parquet_file.name)
for parquet_file in data_dir.glob("*")
)
Unfortunately this is quite slow. Is there a similar way to do this with pyarrow (or any other efficient package)?
import pyarrow.parquet as pq
table = pq.read_table(data_dir, use_threads=True)
df = table.to_pandas()