Is there a way to incrementally update Dask metadata file?

Question

I'm trying to process a dataset and make incremental updates as writing it out in Dask. The Dask metadata file would help a lot when it comes to rereading the processed data. However, as I write new partitions/subsets to the same path, the metadata there gets overwritten by the new partitions/subsets rather than updated with them included.

import dask.dataframe as dd

df = dd.read_parquet(read_path)
# some transformations
df = …
df.to_parquet(write_path, partition_on=[col1, col2, …], write_metadata_file=True)

Looked at a few places and haven't found an obvious way to do this. Does anyone know if anyone has done something that handles such a use case? Could be either incrementally update the metadata files or make edits to/combine a few of them. Any suggestions will be appreciated.

Krishan · Answer 1 · 2020-10-16T08:21:31.083

0

Dask's to_parquet() method has an append mode which I think is exactly what you want here:

append : bool, optional

    If False (default), construct data-set from scratch.
    If True, add new row-group(s) to an existing data-set.
    In the latter case, the data-set must exist, and the schema must match the input data.

I have used this successfully with the pyarrow engine, version 1.0.1

edited Oct 16 '20 at 08:21

answered Oct 13 '20 at 09:52

Krishan

1
1

Thanks! I've tried using `append`. While it does incrementally update the partitions/subsets, it doesn't incrementally update the `_metadata` file, which is what I hope to do. The `_metadata` gets overwritten even in `append` mode. – Shi Fan Oct 13 '20 at 14:05
Huh, my usage of append mode leads to the metadata file being updated correctly. what engine are you using? I've had no problems using `pyarrow`. – Krishan Oct 14 '20 at 15:26
Just to be clear - the `_metadata` file will always be overwritten, but I understand you mean that the new `_metadata` file after doing an `append` ignores the pre-existing partitions? – Krishan Oct 14 '20 at 15:27
I was using `fastparquet`. And yes, I mean the new `_metadata` file would ignore the pre-existing partitions rather than add the new partitions to the pre-existing ones. To be clear, did you get it to work correctly (i.e. `_metadata` gets overwritten consistently with all partitions in the directory) while using the `pyarrow` engine? – Shi Fan Oct 14 '20 at 19:19
Yep, I have test written/appended/read many datasets and it has worked for me every time, using `pyarrow==1.0.1` – Krishan Oct 15 '20 at 08:16
Yeah, can confirm `pyarrow` works fine. Seems to be a `fastparquet`-specific issue. Thanks for the info! – Shi Fan Oct 15 '20 at 20:11
It would be great if you could raise an issue on the `fastparquet` repo to track this! https://github.com/dask/fastparquet – Krishan Oct 16 '20 at 08:18

score 0 · Accepted Answer · answered Oct 16 '20 at 14:48

0

This problem is specific to the fastparquet engine (works fine in pyarrow).

answered Oct 16 '20 at 14:48

Shi Fan

1
2

Is there a way to incrementally update Dask metadata file?

2 Answers2