Running AWS Glue ETL Job (Spark) for large data

Question

Currently, I have a GLUE ETL Script in Scala.

Following are my GLUE script settings:

Spark 2.4, Scala 2 (Glue Version 2.0)
Worker type : G1.X (Recommended for memory intensive job)
Number of workers : 10

I am reading 60 GB data in the database that I am reading in the dataframe like this

val largeDF = glueContext.getCatalogSource(database = "", tableName = "").getDynamicFrame().toDF()

val smallDF = glueContext.getCatalogSource(database = "", tableName = "").getDynamicFrame().toDF()

val result = largeDF.join(broadcast(smallDF) , smalldf("col") === largeDF("col"), "leftsemi")

result.show(false)

However this is running for 6 hours, and fails with the following error

Exception in task 0.0 in stage 3.0 (TID 77)
java.io.IOException: No space left on device

Do I need to increase the number of workers ? What's the best way to calculate ideal setting for reading large data in AWS Glue?

Increasing will not help.. try changing worker type or repartition the data before join operation — Prabhakar Reddy, Feb 04 '21 at 02:24
Could you please give some details around the data? is it CSV/JSON/Parquet? How many rows in each table? And most important - are the values in the columns that those are joined on unique? If not what's approximate size of the resulting dataset? Once you have answers to all these it might be easier to understand if scaling can help, or if you need to adjust the join conditions. — GSazheniuk, Feb 04 '21 at 22:41
Data is around 60 GB. Glue job uses jdbc connection to fetch data from SQL server. — 2shar, Aug 20 '21 at 14:02
@p-d Not really. Workaround was just to use a container. Still looking for a solution though in Spark. — 2shar, Dec 16 '21 at 19:04
@2shar not sure if it is a proper solution, but maybe if you need a workaround try setting this Glue Job parameter: Key: `--write-shuffle-spills-to-s3` Value: `true` https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-shuffle-manager.html — P D, Dec 17 '21 at 10:43

Running AWS Glue ETL Job (Spark) for large data

0 Answers0