Given a parquet dataset with a job_id abc
, saved in parts as such:
my_dataset/
part0.abc.parquet
part1.abc.parquet
part2.abc.parquet
It is possible to read the dataset with vaex or pandas:
import vaex
df = vaex.open('my_dataset')
import pandas as pd
df = pd.read_parquet('my_dataset')
But sometimes our ETL pipeline appends to the the my_dataset
directory with parts of parquet from another job id xyz
, that caused the directory to becomes something like:
my_dataset/
part0.abc.parquet
part0.xyz.parquet
part1.abc.parquet
part1.xyz.parquet
part2.abc.parquet
part2.xyz.parquet
The main problem is we don't know the job ID created by the ETL pipeline, but we know that they are unique.
Is there some method in pandas.read_parquet
to automatically group the parts together? E.g.
import pandas as pd
dfs = pd.read_parquet('my_dataset')
[out]:
{
'abc': pd.DataFrame, # That reads from `part*.abc.parquet`
'xyz': pd.DataFrame # That reads from `part*.xyz.parquet`
}
I've tried doing some glob reading