I have a kafka topic from which I am getting multiple JSONs landing into a databricks table: landing_table
. We don't do anything in this table to keep data as a source of truth.
We have another layer: staging_table
where we read JSON from the landing_table
, flatten+explode the nested array elements of it and ingest it into the staging_table
.
def source_data(schema: str, table: str, checkpoint: str):
try:
print('In getData_ae')
df = spark.readStream.format('delta') \
.option('startingOffsets', 'earliest') \
.option('checkpointLocation', checkpoint) \
.option('ignoreChanges', True) \
.table(f'{schema}.{table}')
return df
except Exception as error:
traceback.print_exc()
sys.exit(1)
df = source_data('schema', 'table', 'checkpoint')
The dataframe df
contains 5.6mill
rows.
To test the throughput of delta streaming, I ran a lift and shift code where I just read data from staging_table
and save it on another s3 bucket as below.
def dummy_write(df, batchId):
df.write.saveAsTable('some_schema.some_table', format='delta', mode='overwrite', path='s3://some_s3_path')
df.writeStream.format("delta") \
.option("checkpointLocation", 's3://some_location') \
.foreachBatch(dummy_write) \
.trigger(once=True) \
.start() \
.awaitTermination()
The problem I am facing here is my first job that reads data from the landing_table
performs explode+flatten+multiple_transformations and outputs a dataframe of 5.6mill rows, it's completing in 3 mins.
But the simple lift and shift activity I mentioned above is taking 3 hours to complete.
I am running this application on a config of: Driver: 128 GB cluster, 16 cores Workers: 8 workers, memory: 976 GB, 128 cores cluster
I am really confused at the occurrence of this phenomenon. Could anyone please let me know if I am doing anything wrong here? Is there any config/parameter I missed in my job? I took the reference of delta streaming from here. Any help is massively appreciated.
Edit1:
Adding the application master UI to show the job stages. In the screenshot the duration of the marked one says 1hour. I left it as it is and even after 3hours, its still the same.