I have my Job code like this:
sc = SparkContext()
glueContext = GlueContext(sc)
s3_paths = ['01', '02', '03'] #these paths are in the same folder and are partitioned under the source_path
s3_source_path = 'bucket_name/'
for sub_path in s3_paths :
s3_path = s3_source_path + '/' sub_path
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# get data from s3 path
job_DyF = glueContext.create_dynamic_frame.from_options('s3', {"paths": [path], "recurse": True}, "json", format_options={"jsonPath": "$[*]"}, transformation_ctx = "job_DyF")
# write dataset to s3 avro
data_sink = glueContext.write_dynamic_frame.from_options(frame = df_verify_filtered, connection_type = "s3", connection_options = {"path": "s3://target", "partitionKeys": ["partition_0", "partition_1", "partition_2"]}, format = "avro", transformation_ctx = "data_sink")
job.commit()
After job succeeded, there are missing records from some of the sub_paths.
When I tried to run the job again, it says no new file detected
.
So I tried to run the code with specific sub_path, without the for sub_path in paths
, strangely, the problem occur when the job is run for the sub_path #2:
it says
no new file detected
for sub_path '02',
even though the job only ran for the 1st sub_path '01' and only the data from the 1st sub_path got ingested to S3 avro.
I cannot figure out what is wrong with the way I set this bookmark so your insight would be really appreciated!. Thanks in advance.