aws glue bookmark multiple folders in one job run not working

Question

I have my Job code like this:

sc = SparkContext()
glueContext = GlueContext(sc)
s3_paths = ['01', '02', '03'] #these paths are in the same folder and are partitioned under the source_path
s3_source_path = 'bucket_name/'
for sub_path in s3_paths :
    s3_path = s3_source_path + '/' sub_path
    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    
    # get data from s3 path
    job_DyF = glueContext.create_dynamic_frame.from_options('s3', {"paths": [path], "recurse": True}, "json", format_options={"jsonPath": "$[*]"}, transformation_ctx = "job_DyF")
    
    # write dataset to s3 avro
    data_sink = glueContext.write_dynamic_frame.from_options(frame = df_verify_filtered, connection_type = "s3", connection_options = {"path": "s3://target", "partitionKeys": ["partition_0", "partition_1", "partition_2"]}, format = "avro", transformation_ctx = "data_sink")
    
    job.commit()

After job succeeded, there are missing records from some of the sub_paths.

When I tried to run the job again, it says no new file detected.

So I tried to run the code with specific sub_path, without the for sub_path in paths, strangely, the problem occur when the job is run for the sub_path #2:

it says no new file detected for sub_path '02',

even though the job only ran for the 1st sub_path '01' and only the data from the 1st sub_path got ingested to S3 avro.

I cannot figure out what is wrong with the way I set this bookmark so your insight would be really appreciated!. Thanks in advance.

It will help people to understand your problem if you post the full script that you wrote as you need to set few params correctly for Glue bookmarks to work — Prabhakar Reddy, Jul 30 '20 at 15:44
@PrabhakarReddy thanks for your response. I have updated the code — phoebe, Jul 30 '20 at 16:02
Can you try removing for loop and pass three paths sequentially ? Create three separate dynamic frames? — Prabhakar Reddy, Aug 04 '20 at 07:52
@PrabhakarReddy removing the loop its working fine. but the thing is i need the loop as there will be more files added in S3 automatically so I use the loop to automate the process. — phoebe, Aug 04 '20 at 10:24
I don't think that works as Glue will store the context for bookmarking at Job level, so for each run it will have different status and not for each s3 path. You might need to create multiple jobs on the fly and call them via boto3 in your for loop. — Prabhakar Reddy, Aug 04 '20 at 11:04
@phoebe were you able to bookmark multiple s3 paths in a single bucket? I am trying to do this as well, and the bookmark isnt working. — stochasticcrap, Nov 20 '20 at 02:26
@stochasticcrap in the end due to complicated daily load process, I control incremental load by myself, without using bookmark feature from glue — phoebe, Nov 28 '20 at 16:39

aws glue bookmark multiple folders in one job run not working

0 Answers0