AWS Glue takes a long time to finish

Question

I just run a very simple job as follows

glueContext = GlueContext(SparkContext.getOrCreate())
l_table = glueContext.create_dynamic_frame.from_catalog(
             database="gluecatalog",
             table_name="fctable") 
l_table = l_table.drop_fields(['seq','partition_0','partition_1','partition_2','partition_3']).rename_field('tbl_code','table_code')
print "Count: ", l_table.count()
l_table.printSchema()
l_table.select_fields(['trans_time']).toDF().distinct().show()
dfc = l_table.relationalize("table_root", "s3://my-bucket/temp/")
print "Before keys() call "
dfc.keys()
print "After keys() call "
l_table.select_fields('table').printSchema()
dfc.select('table_root_table').toDF().where("id = 1 or id = 2").orderBy(['id','index']).show()
dfc.select('table_root').toDF().where("table = 1 or table = 2").show()

The data structure is simple too

root
|-- table: array
| |-- element: struct
| | |-- trans_time: string
| | |-- seq: null
| | |-- operation: string
| | |-- order_date: string
| | |-- order_code: string
| | |-- tbl_code: string
| | |-- ship_plant_code: string
|-- partition_0
|-- partition_1
|-- partition_2
|-- partition_3

When I run job test, it took anywhere from 12 to 16 minutes to finish. But the cloud watch log showed that the job took 2 seconds to display all my data.

So my questions are: Where does AWS Glue job spend its time beyond the logging could show and is what it doing outside the logging period?

Rick Coleman · Accepted Answer · 2017-10-25T17:35:30.910

It's taking the time to setup the environment that allows your code to run. I had the same issue, contacted the AWS GLUE team and they were helpful. The reason it takes a long time is that GLUE builds an environment when you run the first job (which stays alive for 1 hours) if you run the same script twice or any other script within one hour, the next job will take significantly less time. They call this Cold Start when you run the first script, It took my first job 17 minutes, I ran the same job again right after the first one finished and it took 3 minutes only.

score 10 · Answer 2 · answered May 20 '19 at 23:09

10

Update as of May 2019 -

Cold start times = 7-8 minutes
Warm pool maintained for = 10-15 mins

answered May 20 '19 at 23:09

human

2,250
20
24

Is it possible to extend the warm pool time? – pavel_orekhov May 28 '19 at 20:42
No way of extending the warm pool time. Thats something AWS will definitely not publish to its tenants. You could however, run a dummy warming job every 14 mins to keep it warm (cost implications - min charge for Glue is $/10min) – human May 29 '19 at 23:04

score 1 · Answer 3 · answered Dec 05 '17 at 23:35

1

when taking the action of editing a job, you can add more DPUs under the "Script libraries and job parameters (optional)" section. It helps some, but do not expect any major improvement, my experience.

answered Dec 05 '17 at 23:35

Jie

1,107
1
14
18

AWS Glue takes a long time to finish

3 Answers3

Linked