1

I want to read a folder full of parquet files that contain pandas DataFrames. In addition to the data that I'm reading I want to store the filenames from which the data is read in the column "file_origin". In pandas I am able to do it like this:

import pandas as pd
from pathlib import Path

data_dir = Path("path_of_folder_with_files")
df = pd.concat(
                pd.read_parquet(parquet_file).assign(file_origin=parquet_file.name)
                for parquet_file in data_dir.glob("*")
            )


Unfortunately this is quite slow. Is there a similar way to do this with pyarrow (or any other efficient package)?

import pyarrow.parquet as pq

table = pq.read_table(data_dir, use_threads=True)
df = table.to_pandas()
Carsten
  • 2,765
  • 1
  • 13
  • 28

1 Answers1

1

You could implement it using arrow instead of pandas:

batches = []
for file_name in data_dir.glob("*"):
    table = pq.read_table(file_name)
    table = table.append_column("file_name", pa.array([file_name]*len(table), pa.string()))
    batches.extend(table.to_batches())
return pa.Table.from_batches(batches)

I don't expect it to be significantly faster, unless you have a lot of strings and objects in your table (which are slow in pandas).

0x26res
  • 11,925
  • 11
  • 54
  • 108