4

I am trying to use pyarrow.dataset.write_dataset function to write data into hdfs. However, if i write into a directory that already exists and has some data, the data is overwritten as opposed to a new file being created. Is there a way to "append" conveniently to already existing dataset without having to read in all the data first? I do not need the data to be in one file, i just don't want to delete the old one.

What i currently do and doesn't work:

import pyarrow.dataset as ds
parquet_format = ds.ParquetFileFormat()
write_options = parquet_format.make_write_options(
use_deprecated_int96_timestamps = True,
coerce_timestamps = None, 
allow_truncated_timestamps = True)
ds.write_dataset(data = data, base_dir = 'my_path', filesystem = hdfs_filesystem, format = parquet_format, file_options = write_options)
ira
  • 2,542
  • 2
  • 22
  • 36

2 Answers2

6

Currently, the write_dataset function uses a fixed file name template (part-{i}.parquet, where i is a counter if you are writing multiple batches; in case of writing a single Table i will always be 0).

This means that when writing multiple times to the same directory, it might indeed overwrite pre-existing files if those are named part-0.parquet.

How you can solve this is by ensuring that write_dataset uses unique file names for each write through the basename_template argument, eg:

ds.write_dataset(data=data, base_dir='my_path',
                 basename_template='my-unique-name-{i}.parquet', ...)

If you want to have automatically a unique name each time you write, you could eg generate a random string to include in the file name. One option for this is using the python uuid stdlib module: basename_template = "part-{i}-" + uuid.uuid4().hex + ".parquet". Another option could be to include the current time of writing in the filename to make it unique, eg with basename_template = "part-{:%Y%m%d}-{{i}}.parquet".format(datetime.datetime.now())

See https://issues.apache.org/jira/browse/ARROW-10695 for some more discussion about this (customizing the template), and I opened a new issue specifically about the issue of silently overwriting data: https://issues.apache.org/jira/browse/ARROW-12358

joris
  • 133,120
  • 36
  • 247
  • 202
  • Thank you for the answer and opening the ticket! Do I therefore understand correctly, that the current behavior for datasets that are partitioned will lead to sometimes overwriting and sometimes appending data, depending on whether the name matches the file that was in a given partition or not? (the default behavior is i think to increment the counter by 1 for each partition) – ira Apr 13 '21 at 12:42
  • 1
    Indeed, depending on how many parts it would write / are already present, you can get a mixture of overwriting/appending (which I think is certainly something we should fix, for which I opened https://issues.apache.org/jira/browse/ARROW-12358). Now in practice, it depends on your `data` passed to `write_dataset`. If it's an in-memory table, there will always be only "part-0.parquet" be written. If it's a partitioned dataset itself (eg from a different file format), there can indeed be many parts written. – joris Apr 13 '21 at 13:33
0

For those that are here to work out how to use make_write_options() with write_dataset, try this:

import pyarrow.dataset as ds
parquet_format = ds.ParquetFileFormat()
write_options = parquet_format.make_write_options(use_deprecated_int96_timestamps = False, coerce_timestamps = 'us', allow_truncated_timestamps = True)
Contango
  • 76,540
  • 58
  • 260
  • 305