0

Given a parquet dataset with a job_id abc, saved in parts as such:

my_dataset/
  part0.abc.parquet
  part1.abc.parquet
  part2.abc.parquet

It is possible to read the dataset with vaex or pandas:

import vaex
df = vaex.open('my_dataset')

import pandas as pd
df = pd.read_parquet('my_dataset')

But sometimes our ETL pipeline appends to the the my_dataset directory with parts of parquet from another job id xyz, that caused the directory to becomes something like:

my_dataset/
  part0.abc.parquet
  part0.xyz.parquet
  part1.abc.parquet
  part1.xyz.parquet
  part2.abc.parquet
  part2.xyz.parquet

The main problem is we don't know the job ID created by the ETL pipeline, but we know that they are unique.

Is there some method in pandas.read_parquet to automatically group the parts together? E.g.

import pandas as pd
dfs = pd.read_parquet('my_dataset')

[out]:

{
 'abc': pd.DataFrame, # That reads from `part*.abc.parquet`
 'xyz': pd.DataFrame  # That reads from `part*.xyz.parquet`
} 

I've tried doing some glob reading

alvas
  • 115,346
  • 109
  • 446
  • 738
  • I don't think it's currently supported from the box. See [this answer](https://stackoverflow.com/a/60140780/4727702) for the usage of `glob` – Yevhen Kuzmovych Feb 28 '23 at 13:38
  • 1
    Wouldn't it be worth fixing the directory structure (e.g. with subdirectories?) – mozway Feb 28 '23 at 13:47
  • Not that I know of. If I were in your position and couldn't change the dir structure, I'd go with `glob` -> `sort` -> `groupby`. – Timus Feb 28 '23 at 15:47

0 Answers0