Currently, I have a GLUE ETL Script in Scala.
Following are my GLUE script settings:
- Spark 2.4, Scala 2 (Glue Version 2.0)
- Worker type : G1.X (Recommended for memory intensive job)
- Number of workers : 10
I am reading 60 GB data in the database that I am reading in the dataframe like this
val largeDF = glueContext.getCatalogSource(database = "", tableName = "").getDynamicFrame().toDF()
val smallDF = glueContext.getCatalogSource(database = "", tableName = "").getDynamicFrame().toDF()
val result = largeDF.join(broadcast(smallDF) , smalldf("col") === largeDF("col"), "leftsemi")
result.show(false)
However this is running for 6 hours, and fails with the following error
Exception in task 0.0 in stage 3.0 (TID 77)
java.io.IOException: No space left on device
Do I need to increase the number of workers ? What's the best way to calculate ideal setting for reading large data in AWS Glue?