I am trying to create dask dataframe from large number of parquet files stored different HDFS directories. I have tried two approaches but both of them seems to take very long time.
Approach 1: call api read_parquet with glob path. eg:
dd.read_parquet("some path/**/*.parquet")
Approach 2: create dask dataframe from each directory and then call dd.concat on list of all dataframe. For each dir:
dd.read_parquet("some path/dirx/*.parquet")
and then concat:
dd.concat([list of df from each dir],interleave_partitions=True)
In both approach it takes very long time to create the dataframe.
Please suggest best approach to read these parquet files.