2

I have a kafka topic from which I am getting multiple JSONs landing into a databricks table: landing_table. We don't do anything in this table to keep data as a source of truth.

We have another layer: staging_table where we read JSON from the landing_table, flatten+explode the nested array elements of it and ingest it into the staging_table.

def source_data(schema: str, table: str, checkpoint: str):
    try:
      print('In getData_ae')
      df = spark.readStream.format('delta') \
        .option('startingOffsets', 'earliest') \
        .option('checkpointLocation', checkpoint) \
        .option('ignoreChanges', True) \
        .table(f'{schema}.{table}')
      return df
    except Exception as error:
        traceback.print_exc()
        sys.exit(1)

df = source_data('schema', 'table', 'checkpoint')

The dataframe df contains 5.6mill rows. To test the throughput of delta streaming, I ran a lift and shift code where I just read data from staging_table and save it on another s3 bucket as below.

def dummy_write(df, batchId):
    df.write.saveAsTable('some_schema.some_table', format='delta', mode='overwrite', path='s3://some_s3_path')

df.writeStream.format("delta") \
      .option("checkpointLocation", 's3://some_location') \
      .foreachBatch(dummy_write) \
      .trigger(once=True) \
      .start() \
      .awaitTermination()

The problem I am facing here is my first job that reads data from the landing_table performs explode+flatten+multiple_transformations and outputs a dataframe of 5.6mill rows, it's completing in 3 mins. But the simple lift and shift activity I mentioned above is taking 3 hours to complete.

I am running this application on a config of: Driver: 128 GB cluster, 16 cores Workers: 8 workers, memory: 976 GB, 128 cores cluster

I am really confused at the occurrence of this phenomenon. Could anyone please let me know if I am doing anything wrong here? Is there any config/parameter I missed in my job? I took the reference of delta streaming from here. Any help is massively appreciated.

Edit1: Adding the application master UI to show the job stages. In the screenshot the duration of the marked one says 1hour. I left it as it is and even after 3hours, its still the same. enter image description here

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Metadata
  • 2,127
  • 9
  • 56
  • 127
  • Could you share which jobs and stages take most of the time? Also could you share the DAG of the related parts? – Jonathan Lam Apr 25 '23 at 01:25
  • Can you do an explain() and share the Physical plan here, also is it possible to post screenshot from the long-running tasks ? – Abdennacer Lachiheb Apr 26 '23 at 22:42
  • the culprit for that is usually transformations that are you doing. Write itself should happen quick, but you see that write is slow because it's triggering the evaluation of your transformations – Alex Ott Apr 28 '23 at 13:39
  • @AlexOtt But should it really need more than one hour to evaluate transformations? – Metadata May 03 '23 at 06:34
  • @Metadata can you go to the stages tab, show the ones that take a lot of time, post the result here, and also add a screenshot of the Summary metrics and the tasks' table something like this https://stackoverflow.com/a/74857055/6325994 I'm trying to find if there's data skew, spills ... also you didn't add neither your physical plan nor a screenshot from your SQL tab of your Dag. – Abdennacer Lachiheb May 03 '23 at 06:42
  • Ok, I going to reproduce the issue with details you asked for and present it here. – Metadata May 03 '23 at 10:34

2 Answers2

0

I suspect at least part of your problem is due to the fact that you're writing to S3.

Because Apache Spark is a distributed computing engine, it needs to be able to tell when a certain task/job is really done in some way or another (multiple cores could be running the same task at the same time). By default, this is done by renaming files/directories.

Renaming files is very slow in S3: since there is no direct method to do file renames, you're actually copying + deleting the file. This means that all of your files are actually written/deleted multiple times. What you'll need to do to go around this is turn on one of the S3 output committers.

The config I successfully used in the past, using the magic committer, (this wasn't a streaming application though) was the following:

spark.sql.sources.commitProtocolClass                     = org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
spark.sql.parquet.output.committer.class                  = org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a = org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.hadoop.fs.s3a.committer.name                        = magic
spark.hadoop.fs.s3a.committer.magic.enabled               = true

I can't tell you exactly which versions you'll need (this depends on your setup), but you'll need these dependencies to get this set up:

Koedlt
  • 4,286
  • 8
  • 15
  • 33
  • Interesting, but my other applications are running fine without these parameters. – Metadata Apr 24 '23 at 10:15
  • Hmmm ok, are those other applications writing to S3 too then? Anyways, at this point I fear a bit more info might be needed. Like the other commenter said: it might be useful to share some stats about the longest running stage(s) (maybe some screenshots from you webUI), how many tasks those stages have, what the Summary metrics are for those tasks, ... – Koedlt Apr 25 '23 at 07:33
0

What version of SPARK are you running?

Looks like other have reported similar issues with Spark streaming using S3 for checkpointing.

In general successive check pointing to S3 degrades performance. S3 takes longer to lookup files the more that you write to the same directory. That is because S3 file paths are actually just keys to buckets. So the more files you write to the same path the longer each successive lookup takes. There are things you can do to speed this up. But really the key is not using S3 that was made for slow cheap storage for a high performance engine like Spark Streaming.

It's unfortunate but you can't increase write speed for checkpoints using the usual S3 tricks.

This article goes into detail about how to see if the spark committing is the actual issue. I suggest you can also see this by looking at the spark console. There are Gaps, when tasks should be running, but for some reason they aren't. Digging into the spark logs you'll see why there are odd gaps. From the article a great example:

The 48-min break at the end of the Spark application is also clearly visible in the Spark driver logs:

21/11/08 20:52:11 INFO DAGScheduler: Job 7 finished: insertInto at NativeMethodAccessorImpl.java:0, took 3495.605049 s
21/11/08 21:40:13 INFO FileFormatWriter: Write Job 13ca8cb6-5fc0-4fe9-9fd0-bba5cf9e2f7f committed.

If this is your issue great you found the problem. In that case I suggest you use a committer as suggested by @Koedit

Matt Andruff
  • 4,974
  • 1
  • 5
  • 21
  • spark streaming committer code doesn't use the same commit algorithms for which the s3a or emr optimised committers handle. streaming checkpoints are writes + renames, and renames are O(data) on s3. I doubt this is the issue, but logging driver output at DEBUG is a good place to start to identify whatever the problem is – stevel May 02 '23 at 12:18