Glue Dynamic Frame is way slower than regular Spark

Question

In the image below we have the same glue job run with three different configurations in terms of how we write to S3:

We used a dynamic frame to write to S3
We used a pure spark frame to write to S3
Same as 1 but reducing the number of worker nodes from 80 to 60

All things equal, the dynamic frame took 75 minutes to do the job, regular Spark took 10 minutes. The output were 100 GB of data.
The dynamic frame is super sensitive to the number of worker nodes, failing due to memory issues after 2 hours of processing when slightly reducing the number of worker nodes. This is surprising as we would expect Glue, being an AWS service, to handle better the S3 writing operations.

The code difference was this:

if dynamic:
    df_final_dyn = DynamicFrame.fromDF(df_final, glueContext, "df_final")

    glueContext.write_dynamic_frame.from_options(
    frame=df_final_dyn, connection_type="s3", format="glueparquet", transformation_ctx="DataSink0",
    connection_options={"path": "s3://...", 
    "partitionKeys": ["year", "month", "day"]})
else:
    spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
    df_final.write.mode("overwrite").format("parquet").partitionBy("year", "month", "day")\
             .save("s3://.../")

Why such an inefficiency?

Glue Dynamic Frame is way slower than regular Spark

0 Answers0