12

In the image below we have the same glue job run with three different configurations in terms of how we write to S3:

  1. We used a dynamic frame to write to S3
  2. We used a pure spark frame to write to S3
  3. Same as 1 but reducing the number of worker nodes from 80 to 60
  • All things equal, the dynamic frame took 75 minutes to do the job, regular Spark took 10 minutes. The output were 100 GB of data.
  • The dynamic frame is super sensitive to the number of worker nodes, failing due to memory issues after 2 hours of processing when slightly reducing the number of worker nodes. This is surprising as we would expect Glue, being an AWS service, to handle better the S3 writing operations.

The code difference was this:

if dynamic:
    df_final_dyn = DynamicFrame.fromDF(df_final, glueContext, "df_final")

    glueContext.write_dynamic_frame.from_options(
    frame=df_final_dyn, connection_type="s3", format="glueparquet", transformation_ctx="DataSink0",
    connection_options={"path": "s3://...", 
    "partitionKeys": ["year", "month", "day"]})
else:
    spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
    df_final.write.mode("overwrite").format("parquet").partitionBy("year", "month", "day")\
             .save("s3://.../")

enter image description here

Why such an inefficiency?

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
justHelloWorld
  • 6,478
  • 8
  • 58
  • 138

0 Answers0