I am trying to use pyarrow.dataset.write_dataset
function to write data into hdfs. However, if i write into a directory that already exists and has some data, the data is overwritten as opposed to a new file being created. Is there a way to "append" conveniently to already existing dataset without having to read in all the data first? I do not need the data to be in one file, i just don't want to delete the old one.
What i currently do and doesn't work:
import pyarrow.dataset as ds
parquet_format = ds.ParquetFileFormat()
write_options = parquet_format.make_write_options(
use_deprecated_int96_timestamps = True,
coerce_timestamps = None,
allow_truncated_timestamps = True)
ds.write_dataset(data = data, base_dir = 'my_path', filesystem = hdfs_filesystem, format = parquet_format, file_options = write_options)