run glue job only when data is updated

Question

I have a glue job that transfers data from S3 to Redshift. I want it to schedule it such that it runs everytime when the data in S3 is reuploaded or updated. How can I do so? I tried the code sol here and made a lambda function: How to Trigger Glue ETL Pyspark job through S3 Events or AWS Lambda?

import boto3
print('Loading function')

def lambda_handler(event, context):
    source_bucket = event['Records'][0]['s3']['bucket']['name']
    s3 = boto3.client('s3')
    glue = boto3.client('glue')
    gluejobname = "YOUR GLUE JOB NAME"

    try:
        runId = glue.start_job_run(JobName=gluejobname)
        status = glue.get_job_run(JobName=gluejobname, RunId=runId['JobRunId'])
        print("Job Status : ", status['JobRun']['JobRunState'])
    except Exception as e:
        print(e)
        print('Error getting object {} from bucket {}. Make sure they exist '
              'and your bucket is in the same region as this '
              'function.'.format(source_bucket, source_bucket))
    raise e

Replaced the job name. However, running this gives me:

Response
{
  "errorMessage": "'Records'",
  "errorType": "KeyError",
  "stackTrace": [
    "  File \"/var/task/lambda_function.py\", line 5, in lambda_handler\n    source_bucket = event['Records'][0]['s3']['bucket']['name']\n"
  ]
}

Function Logs
START RequestId: 9d063917-958a-494c-8ef9-f1f58e866562 Version: $LATEST
[ERROR] KeyError: 'Records'
Traceback (most recent call last):
  File "/var/task/lambda_function.py", line 5, in lambda_handler
    source_bucket = event['Records'][0]['s3']['bucket']['name']
END RequestId: 9d063917-958a-494c-8ef9-f1f58e866562
REPORT RequestId: 9d063917-958a-494c-8ef9-f1f58e866562  Duration: 9.41 ms   Billed Duration: 10 ms  Memory Size: 128 MB Max Memory Used: 65 MB  Init Duration: 305.81 ms

Request ID
9d063917-958a-494c-8ef9-f1f58e866562

how are you triggering this lambda job? I don't see the print statements in the response log. If you have configured local lambda unit testing, please make sure you are running correct test script from the drop-down. — Yuva, Mar 04 '21 at 03:09
"Hello from Lambda"? It looks like the template functions... Does your rules trigger the right Lambda Function? Maybe some "deploy" button to be clicked?... I can't see any Hello in your code... — fernolimits, Mar 04 '21 at 08:06
Sorry, I updated the error. I am supposed to write my bucket name in here right ```['bucket']['name']```@Yuva — , Mar 04 '21 at 10:24
Sorry, I updated the error. I am supposed to write my bucket name in here right ```['bucket']['name']```@fernolimits — , Mar 04 '21 at 10:24

score 0 · Answer 1 · answered Mar 05 '21 at 06:07

0

You dont have to update anything except the GLUE JOB NAME at line # 8. Source bucket info is retrieved from the EVENT object. Upload a file to the s3 object location as per the lambda trigger configurations, and check the cloudwatch logs.

answered Mar 05 '21 at 06:07

Yuva

2,831
7
36
60

What action should i use https://docs.aws.amazon.com/AmazonS3/latest/API/API_UploadPart.html if I want to check for an update in S3 file? For cloudwatch event – Mar 08 '21 at 15:42

run glue job only when data is updated

1 Answers1