0

I am working on a project that includes cleaning a large dataset. I learned how to create a dataset using multiple parquet files, but I did not find a way to make changes and overwrite, delete, or mutate new columns to the dataset.

Hope you can help me with that.

  • 2
    You should provide more details, better yet a reproducible example. Anyway, the only way you can _persist_ changes in a parquet file or parquet dataset is saving another one, because parquet files are immutable. You can load a parquet file (or a parquet dataset, with multiple files in a Hive-partition style) into a Arrow Table and use Arrow or DuckDB to clean that, but those changes will not be saved until you either put all this data in a DuckDB table or save again in another parquet file (or parquet dataset). – Fabio Vaz Sep 05 '22 at 14:53
  • The key takeaway from that well-informed comment: ***parquet files are immutable***. Thanks @FabioVaz – r2evans Sep 09 '22 at 15:55

0 Answers0