1

I am converting my table into parquet file format using Azure Data Factory. Performing query on parquet file using databricks for reporting. I want to update only existing records which are updated in original sql server table. Since I am performing it on very big table and daily I don't want to perform truncate and reload entire table as it will be costly.

Is there any way I can update those parquet file without performing truncate and reload operation.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Aniket Kumar
  • 165
  • 2
  • 10

3 Answers3

1

Parquet is by default immutable, so only way to rewrite the data is to rewrite the table. But that is possible to do if you switch to use of Delta file format that supports updating/deleting the entries, and is also supports MERGE operation.

You can still use Parquet format for production of the data, but then you need to use that data to update the Delta table.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
0

I have found a workaround to this problem.

  1. Read the parquet file into data frame using any tool or Python scripts.
  2. create a temporary table or view from data frame.
  3. Run SQL query to modify, update and delete the record.
  4. Convert table back into data frame
  5. Overwrite existing parquet files with new data.
Aniket Kumar
  • 165
  • 2
  • 10
-1

Always go for soft Delete while working in No-Sql. Hard delete if very costly.

Also, with soft-Delete, down stream pipeline can consume the update and act upon it.