2

I have thousands of parquet files that I need to process. Before processing the files, I'm trying to get various information about the files using the parquet metadata, such as number of rows in each partition, mins, maxs, etc.

I tried reading the metadata using dask.delayed hoping to distribute the metadata gathering tasks across my cluster, but this seems to lead to instability in Dask. See an example code snippet and an error of a node time out below.

Is there a way to read the parquet metadata from Dask? I know Dask's "read_parquet" function has a "gather_statistics" option, which you can set to false to speed up the file reads. But, I don't see a way to access all of the parquet metadata / statistics if it's set to true.

Example code:


@dask.delayed
def get_pf(item_to_read):
     pf = fastparquet.ParquetFile(item_to_read)
     row_groups = pf.row_groups.copy()
     all_stats = pf.statistics.copy()
     col = pf.info['columns'].copy()
     return [row_groups, all_stats, col]

stats_arr = get_pf(item_to_read)

Example error:

2019-10-03 01:43:51,202 - INFO - 192.168.0.167 - distributed.worker - ERROR - Worker stream died during communication: tcp://192.168.0.223:34623

2019-10-03 01:43:51,203 - INFO - 192.168.0.167 - Traceback (most recent call last):

2019-10-03 01:43:51,204 - INFO - 192.168.0.167 -   File "/usr/local/lib/python3.7/dist-packages/distributed/comm/core.py", line 218, in connect

2019-10-03 01:43:51,206 - INFO - 192.168.0.167 -     quiet_exceptions=EnvironmentError,

2019-10-03 01:43:51,207 - INFO - 192.168.0.167 -   File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 729, in run

2019-10-03 01:43:51,210 - INFO - 192.168.0.167 -     value = future.result()

2019-10-03 01:43:51,211 - INFO - 192.168.0.167 - tornado.util.TimeoutError: Timeout

2019-10-03 01:43:51,212 - INFO - 192.168.0.167 -

2019-10-03 01:43:51,213 - INFO - 192.168.0.167 - During handling of the above exception, another exception occurred:

2019-10-03 01:43:51,214 - INFO - 192.168.0.167 -

2019-10-03 01:43:51,215 - INFO - 192.168.0.167 - Traceback (most recent call last):

2019-10-03 01:43:51,217 - INFO - 192.168.0.167 -   File "/usr/local/lib/python3.7/dist-packages/distributed/worker.py", line 1841, in gather_dep

2019-10-03 01:43:51,218 - INFO - 192.168.0.167 -     self.rpc, deps, worker, who=self.address

2019-10-03 01:43:51,219 - INFO - 192.168.0.167 -   File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 729, in run

2019-10-03 01:43:51,220 - INFO - 192.168.0.167 -     value = future.result()

2019-10-03 01:43:51,222 - INFO - 192.168.0.167 -   File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 736, in run

2019-10-03 01:43:51,223 - INFO - 192.168.0.167 -     yielded = self.gen.throw(*exc_info)  # type: ignore

2019-10-03 01:43:51,224 - INFO - 192.168.0.167 -   File "/usr/local/lib/python3.7/dist-packages/distributed/worker.py", line 3029, in get_data_from_worker

2019-10-03 01:43:51,225 - INFO - 192.168.0.167 -     comm = yield rpc.connect(worker)

2019-10-03 01:43:51,640 - INFO - 192.168.0.167 -   File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 729, in run

2019-10-03 01:43:51,641 - INFO - 192.168.0.167 -     value = future.result()

2019-10-03 01:43:51,643 - INFO - 192.168.0.167 -   File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 736, in run

2019-10-03 01:43:51,644 - INFO - 192.168.0.167 -     yielded = self.gen.throw(*exc_info)  # type: ignore

2019-10-03 01:43:51,645 - INFO - 192.168.0.167 -   File "/usr/local/lib/python3.7/dist-packages/distributed/core.py", line 866, in connect

2019-10-03 01:43:51,646 - INFO - 192.168.0.167 -     connection_args=self.connection_args,

2019-10-03 01:43:51,647 - INFO - 192.168.0.167 -   File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 729, in run

2019-10-03 01:43:51,649 - INFO - 192.168.0.167 -     value = future.result()

2019-10-03 01:43:51,650 - INFO - 192.168.0.167 -   File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 736, in run

2019-10-03 01:43:51,651 - INFO - 192.168.0.167 -     yielded = self.gen.throw(*exc_info)  # type: ignore

2019-10-03 01:43:51,652 - INFO - 192.168.0.167 -   File "/usr/local/lib/python3.7/dist-packages/distributed/comm/core.py", line 230, in connect

2019-10-03 01:43:51,653 - INFO - 192.168.0.167 -     _raise(error)

2019-10-03 01:43:51,654 - INFO - 192.168.0.167 -   File "/usr/local/lib/python3.7/dist-packages/distributed/comm/core.py", line 207, in _raise

2019-10-03 01:43:51,656 - INFO - 192.168.0.167 -     raise IOError(msg)

2019-10-03 01:43:51,657 - INFO - 192.168.0.167 - OSError: Timed out trying to connect to 'tcp://192.168.0.223:34623' after 10 s: connect() didn't finish in time
dan
  • 183
  • 13

1 Answers1

1

Does dd.read_parquet take a long time? If not, then you can follow whatever strategy is in there to do the reading in the client.

If the data has a single _metadata file in the root directory, then you can simply open this with fastparquet, which is exactly what Dask would do. It contains all the details of all of the data pieces.

There is no particular reason distributing the metadata reads should be a problem, but you should be aware that in some cases the total metadata items can add up to a substantial size.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • Thanks, @mdurant. dd.read_parquet itself is fast, and gives me a lot of information directly (e.g., column names), but getting information like the number of rows in each partition, is much slower than reading the metadata directly because you basically have to persist/compute the entire dask read operation. So, it sounds like delaying the fastparquet metadata read operations and computing those to distribute the operations as I've been doing is the best approach. Perhaps I have other issues that are causing the instabilities. – dan Oct 03 '19 at 22:50