In Delta Lake everything works on file level. So there are no 'in place' updates or deletes. Say a single record gets deleted (or updated) then roughly the following happens:
- Read in the parquet file with the relevant record (+ the other records which happen to be in the file)
- Write out all records except for the deleted record into a new parquet file
- Update the transaction log with a new version, marking the old parquet file as removed and the new parquet file as added. Note the old parquet file doesn't get physically deleted until you run the
VACUUM
command.
The process for an update is basically the same.
To answer your questions more specifically:
In case record is deleted, what are the chances that the delta log
would point to a new file , and that this deleted record would be read
as a new one?
The delta log will point to a new file, but the deleted record will not be in there. There will be all the other records which happened to be in the original file.
In case record is updated, what would be the chance that delta log
would not point to a new file, and that this updated record would not
be considered ?
Files are not updated in place, so this doesn't happen. A new file is written containing the updated record (+ any other other records in the original file). The transaction log is updated to 'point' to this new file.
In case some maintenance is happening on the delta table, what are the
chances that some new files are written out of nowhere ? Which may
cause a record to be re-ingested
This is possible, although not 'out of nowhere'. For example if you run OPTIMIZE
existing parquet files get reshuffled/combined to improve performance. Basically this means a number of new parquet files will be written and a new version in the transaction log will point to these parquet files. If you pickup all new files after this you will re-ingest data.
Some considerations: if your delta table is append only you could use structured streaming to read from it instead. If not then Databricks offers Change Data Feed giving your record level details of inserts, updates and deletes.