Update existing records of parquet file in Azure

Question

I am converting my table into parquet file format using Azure Data Factory. Performing query on parquet file using databricks for reporting. I want to update only existing records which are updated in original sql server table. Since I am performing it on very big table and daily I don't want to perform truncate and reload entire table as it will be costly.

Is there any way I can update those parquet file without performing truncate and reload operation.

score 1 · Answer 1 · answered Sep 13 '21 at 14:42

Parquet is by default immutable, so only way to rewrite the data is to rewrite the table. But that is possible to do if you switch to use of Delta file format that supports updating/deleting the entries, and is also supports MERGE operation.

You can still use Parquet format for production of the data, but then you need to use that data to update the Delta table.

score 0 · Answer 2 · answered Aug 04 '22 at 14:15

I have found a workaround to this problem.

Read the parquet file into data frame using any tool or Python scripts.
create a temporary table or view from data frame.
Run SQL query to modify, update and delete the record.
Convert table back into data frame
Overwrite existing parquet files with new data.

score -1 · Answer 3 · answered Feb 18 '23 at 13:06

-1

Always go for soft Delete while working in No-Sql. Hard delete if very costly.

Also, with soft-Delete, down stream pipeline can consume the update and act upon it.

answered Feb 18 '23 at 13:06

Deeraj Kumar

1

Your answer is irrelevant to the question asked above – Aniket Kumar Feb 20 '23 at 07:27

Update existing records of parquet file in Azure

3 Answers3