AWS Glue Outputting Empty Files on Sequential Runs

Question

I am trying to automate an ETL pipeline that outputs data from AWS RDS MYSQL to AWS S3. I am currently using AWS Glue to do the job. When I do an initial load from RDS to S3. It captures all the data in the file which is exactly what I want. However, when I add new data to the MYSQL database and run the Glue job again. I get an empty file instead of the added rows. Any help would be MUCH appreciated.

what is primary key in your table in MySql? and can you also share your Glue job Configuration? — Balu Vyamajala, Feb 11 '21 at 01:07
The primary king is a unique user ID/string. Job properties {Type: Spark, Glue Version: Spark 2.4, This job runs: A proposed script generated by Glue, Job Bookmark: Enable, Monitoring Options: all unchecked. Data Source: Just the custom table I made from my crawler (from RDS connection. Transform type: Change schema. Data target: S3 bucket. — Andrew Chen, Feb 11 '21 at 02:20
if you have not configured a key to be used for bookmarking, it will use primary key of the table as default. Are new records you are adding to table sequential i.e new record with userId greater than existing records? — Balu Vyamajala, Feb 11 '21 at 02:29
No, they're pretty much just randomly generated strings, I can try changing them to be sequential like 1,2,3,4,5, etc. to see if that changes anything — Andrew Chen, Feb 11 '21 at 02:32
Yaa.. i think that may be the issue. Glue Bookmarking doesn't keep track of all the ids that it has processed, it just knows what is the last record it processed and during next run it picks up all the new records greater than that last record. — Balu Vyamajala, Feb 11 '21 at 02:45
I changed the user Id to 1,2,3,.etc when I reran the job it took ALL the new entries and the old ones rather than just the new ones — Andrew Chen, Feb 11 '21 at 03:22

score 2 · Answer 1 · answered Feb 11 '21 at 02:58

2

Bookmarking rules for JDBC Sources are here. Important point to remember for JDBC sources is that values have to be increasing or decreasing order and Glue only processes new data from last checkpoint.

Typically, either an autogenerated sequence number or a datatime used as key for bookmarking

answered Feb 11 '21 at 02:58

Balu Vyamajala

9,287
1
20
42

That's a great point, thank you for the help. One issue I'm running into is that it's taking all the new and old entries in subsequent runs with bookmarks enabled. – Andrew Chen Feb 11 '21 at 03:23
That is not expected behavior with job bookmarking enabled. sorry I don't know it would do that. – Balu Vyamajala Feb 11 '21 at 03:31

score 1 · Answer 2 · answered Dec 20 '22 at 11:26

1

For anybody who is still struggling with this (it drove me mad, because i thought my spark code was wrong), disable bookmarking in job details.

answered Dec 20 '22 at 11:26

jan biel

35
6

AWS Glue Outputting Empty Files on Sequential Runs

2 Answers2