I am getting an issue while using Deltalake in AWS cloud.. we are using EMR-EKS to run spark jobs and saving data on S3.
The project in which I am using Deltalake has late-arriving records coming every hour. The data is partitioned (time sorted) and we are doing "Append" of the data to different partitions every hour.
At the end of the day we are launching a parallel compaction job to compact the data to avoid small file issues, but in the compaction run we are only compacting the partitions we are untouched from the last 48 hours to avoid any conflict between append-compact operation.
The issue that we are facing is that during the compaction and regular run sometimes (occasionally) there is a conflict occurring in the _delta_log folder and for some weird reason the JSON (meta) file for the regular append job is overwritten by the compaction job and because of this during Vaccum that partition data is getting removed as well. I am unable to understand this behavior and need some help and guidance in order to avoid this.
Note: The same application is running fine in On-Prem Cloudera cluster for the last 1+ year, this issue is only observed in AWS- S3 . Can this be because of S3's Eventually Consistency behavior?
As per the docs https://docs.delta.io/latest/delta-storage.html#amazon-s3, Supports concurrent writes from a single Spark driver, but we are using EMR-EKS which uses EMRFS and as per the docs https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html and , EMRFS is an implementation of the Hadoop file system used for reading and writing regular files from Amazon EMR directly to Amazon S3.
EMRFS consistent view tracks consistency using a DynamoDB table to track objects in Amazon S3 that have been synced with or created by EMRFS. The metadata is used to track all operations (read, write, update, and copy), and no actual content is stored in it. This metadata is used to validate whether the objects or metadata received from Amazon S3 matches what is expected. This confirmation gives EMRFS the ability to check list consistency and read-after-write consistency for new objects EMRFS writes to Amazon S3 or objects synced with EMRFS.
So the real question is why on EMRFS deltalake is not behaving as expected and not supporting concurrent writes
We are using the following version of delta
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-core_2.11</artifactId>
<version>0.6.1</version>
</dependency>
Any help on this issue is much appreciated
I have already tried to debug the issue on multiple days but only comes when there is a race condition occurring between the Append and Compact job's meta file creation