Optimal approach to create dask dataframe from parquet files(HDFS) in different directories

Question

I am trying to create dask dataframe from large number of parquet files stored different HDFS directories. I have tried two approaches but both of them seems to take very long time.

Approach 1: call api read_parquet with glob path. eg:

dd.read_parquet("some path/**/*.parquet")

Approach 2: create dask dataframe from each directory and then call dd.concat on list of all dataframe. For each dir:

dd.read_parquet("some path/dirx/*.parquet")

and then concat:

dd.concat([list of df from each dir],interleave_partitions=True)

In both approach it takes very long time to create the dataframe.

Please suggest best approach to read these parquet files.

Whatever you do, dask will have to touch every single file and read its metadata before you can construct a dataframe, which in this case means establishing a connection to some HDFS node for each file. — mdurant, Mar 22 '18 at 14:13
Thanks @mdurant for your comment.I have so far seen that spark reads these parquet files in multiple directory much faster than dask. Is it possible to convert to spark dataframe to ask dataframe or is there a way I can convert spark dataframe to dask dataframe using pyarrow. — Santosh Kumar, Mar 23 '18 at 06:12
You may want to experiment with the difference between the parquet readers (fastparquet and pyarrow) and hdfs interfaces (hdfs3 and arrow's libhdfs). One massive improvement you can make is to write _metadata files when creating the data. — mdurant, Mar 23 '18 at 13:05
Also, fastparquet includes a [function](https://github.com/dask/fastparquet/blob/master/fastparquet/writer.py#L993) to create a metadata per directory, or a global metadata file; you only have to do this once, and subsequent reads can use that file rather than opening all the individual data files. — mdurant, Mar 23 '18 at 14:38
Tried creating dask dataframe using pyarrow engine on multiple directory but fails with error(ValueError: Found non-unique column index) when running any action on the dataframe. — Santosh Kumar, Mar 23 '18 at 22:24
Then I recommend using the merger function to make metadata files - the initial action may be slow, but subsequent reads will be faster. Note that fastparquet also works with arrow's hdfs. You could profile what is taking the time, there may be a quick improvement available. — mdurant, Mar 23 '18 at 22:32

Optimal approach to create dask dataframe from parquet files(HDFS) in different directories

0 Answers0