Writing a Parquet file from multiple Python processes using Dask

Question

Is that possible to write the same Parquet folder from different processes in Python?

I use fastparquet.

It seems to work but I m wondering how it is possible for the _metadata file to not have conflicts in case two processes write to it at the same it.

Also to make it works I had to use ignore_divisions=True which is not ideal to get fast performance later when you read the Parquet file right?

score 2 · Accepted Answer · answered Nov 23 '19 at 16:57

2

Dask consolidates the metadata from the separate processes, so that it only writes the _metadata file once the rest is complete, and this happens in a single thread.

If you were writing separate parquet files to a single folder using your own multiprocessing setup, each would typically write the single data file and no _metadata at all. You could either gather the pieces like Dask does, or consolidate the metadata from the data files after they were ready.

answered Nov 23 '19 at 16:57

mdurant

27,272
5
45
74

Thanks! How would you create the metadata file after having wrote all the data files? – hadim Nov 23 '19 at 17:11
Or maybe it's more efficient to create an empty dask dataframe and use map_fn to populate it? But then how I control the writing on disk? – hadim Nov 23 '19 at 17:14
There isn't a single function, but you would start with fastparquet.util.metadata_from_many – mdurant Nov 24 '19 at 00:47

Writing a Parquet file from multiple Python processes using Dask

1 Answers1