Lambda automatically deletes transcribe job upon completion

Question

I am looking to edit my lambda so it will delete the transcription job when it's job status reads "Complete". I have the following code:

 import json
    import time
    import boto3
    from urllib.request import urlopen

    def lambda_handler(event, context):
        transcribe = boto3.client("transcribe")
        s3 = boto3.client("s3")

        if event:
            file_obj = event["Records"][0]
            bucket_name = str(file_obj["s3"]["bucket"]["name"])
            file_name = str(file_obj["s3"]["object"]["key"])
            s3_uri = create_uri(bucket_name, file_name)
            file_type = file_name.split("2019.")[1]
            job_name = file_name
            transcribe.start_transcription_job(TranscriptionJobName=job_name,
                                                Media ={"MediaFileUri": s3_uri},
                                                MediaFormat = file_type,
                                                LanguageCode = "en-US",
                                                Settings={
                                                    "VocabularyName": "Custom_Vocabulary_by_Brand_Other_Brands",
                                                    "ShowSpeakerLabels": True,
                                                    "MaxSpeakerLabels": 4
                                                })


            while True:
                status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
                if status["TranscriptionJob"]["TranscriptionJobStatus"] in ["FAILED"]:
                    break
                print("It's in progress")
            while True:
                status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
                if status["TranscriptionJob"]["TranscriptionJobStatus"] in ["COMPLETED"]:
                    transcribe.delete_transcription_job(TranscriptionJobName=job_name
                )

                time.sleep(5)

            load_url = urlopen(status["TranscriptionJob"]["Transcript"]["TranscriptFileUri"])
            load_json = json.dumps(json.load(load_url))

            s3.put_object(Bucket = bucket_name, Key = "transcribeFile/{}.json".format(job_name), Body=load_json)


        # TODO implement
        return {
            'statusCode': 200,
            'body': json.dumps('Hello from Lambda!')
        }

    def create_uri(bucket_name, file_name):
        return "s3://"+bucket_name+"/"+file_name

The section that handles the job is:

 while True:
        status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
        if status["TranscriptionJob"]["TranscriptionJobStatus"] in ["FAILED"]:
            break
        print("It's in progress")
    while True:
        status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
        if status["TranscriptionJob"]["TranscriptionJobStatus"] in ["COMPLETED"]:
            transcribe.delete_transcription_job(TranscriptionJobName=job_name
        )

If the job is in progress, it will say "It's in progress", but when it reads "Completed" it will delete.

Any ideas why my current code would not be working? It completes the transcribe job but does not delete it.

Check [this](https://stackoverflow.com/help/minimal-reproducible-example), will help to get an answer. — Olvin Roght, Oct 18 '19 at 13:21
@jarmod Apologies, I will amend my original question. The current code does not delete the jobs. It simply completes it as normal and does nothing else. — Owen Murray, Oct 18 '19 at 13:41

jarmod · Answer 1 · 2019-11-20T03:22:17.397

2

You should not poll for information if you can avoid it, especially in Lambda.

The correct way to respond to changes in transcription job status is to use CloudWatch Events. You can, for example, configure a rule to route an event to an AWS Lambda function when a transcription job has completed successfully.

When your Lambda function is invoked as a result of a status change in the transcription job, the Lambda function will receive event data, for example:

{
    "version": "0",
    "id": "1a234567-1a6d-3ab4-1234-abf8b19be1234",
    "detail-type": "Transcribe Job State Change",
    "source": "aws.transcribe",
    "account": "123456789012",
    "time": "2019-11-19T10:00:05Z",
    "region": "us-east-1",
    "resources": [],
    "detail": {
        "TranscriptionJobName": "my-transcribe-test",
        "TranscriptionJobStatus": "COMPLETED"
    }
}

Use the TranscriptionJobName to correlate the state change back to the original job.

edited Nov 20 '19 at 03:22

answered Oct 18 '19 at 14:07

jarmod

71,565
16
115
122

except that it doesn't give you the name of the job that's completed, just the id that doesn't even come back when you request a list of jobs. – Steven Grant Nov 19 '19 at 16:40
@StevenGrant the docs indicate that the Transcribe Event has a `detail` attribute with both `TranscriptionJobName` and `TranscriptionJobStatus`, per https://docs.aws.amazon.com/transcribe/latest/dg/cloud-watch-events.html#events. The `id` is just a CloudWatch identifier. – jarmod Nov 19 '19 at 17:42
yeah, unfortunately `detail` comes back as null, not even the `TranscriptionJobName` or `TranscriptionJobStatus` exists in there – Steven Grant Nov 19 '19 at 17:49
@StevenGrant Did you test this and found `detail` to be empty? It worked correctly for me (multiple times) in us-east-1. Have included an example of the event data in my answer. – jarmod Nov 20 '19 at 03:21
Yeah I did - the docs suggest detail should contain job status and job name but it's just null. Someone did suggest input transformer but no dice on that https://stackoverflow.com/questions/58938814/cloudwatch-events-trigger-on-amazon-transcribe-event – Steven Grant Nov 20 '19 at 09:24
1

@StevenGrant I’d consider raising a support case with the details of the transcribe job name, the CW event ID, and the full CW event data so AWS can look into it. The docs are pretty clear and it does work fine, at least in my experience. – jarmod Nov 20 '19 at 11:52

score 1 · Accepted Answer · answered Oct 18 '19 at 14:08

Sorry guys, I had another look and made a very very stupid mistake. I had the transcribe.delete_transcription_job(TranscriptionJobName=job_name in the complete incorrect part.

Please find the correct and working code below:

import json
import time
import boto3
from urllib.request import urlopen

def lambda_handler(event, context):
    transcribe = boto3.client("transcribe")
    s3 = boto3.client("s3")

    if event:
        file_obj = event["Records"][0]
        bucket_name = str(file_obj["s3"]["bucket"]["name"])
        file_name = str(file_obj["s3"]["object"]["key"])
       s3_uri = create_uri(bucket_name, file_name)
        file_type = file_name.split("2019.")[1]
        job_name = file_name
        transcribe.start_transcription_job(TranscriptionJobName=job_name,
                                            Media ={"MediaFileUri": s3_uri},
                                            MediaFormat = file_type,
                                            LanguageCode = "en-US",
                                            Settings={
                                                "VocabularyName": "Custom_Vocabulary_by_Brand_Other_Brands",
                                                "ShowSpeakerLabels": True,
                                                "MaxSpeakerLabels": 4
                                            })


        while True:
            status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
            if status["TranscriptionJob"]["TranscriptionJobStatus"] in ["COMPLETED", "FAILED"]:
                transcribe.delete_transcription_job(TranscriptionJobName=job_name)
                break
            print("It's in progress")

            time.sleep(5)

        load_url = urlopen(status["TranscriptionJob"]["Transcript"]["TranscriptFileUri"])
        load_json = json.dumps(json.load(load_url))

        s3.put_object(Bucket = bucket_name, Key = "transcribeFile/{}.json".format(job_name), Body=load_json)


    # TODO implement
    return {
        'statusCode': 200,
        'body': json.dumps('Hello from Lambda!')
    }

def create_uri(bucket_name, file_name):
    return "s3://"+bucket_name+"/"+file_name

While this might work, it's not optimal. Specifically, it's fragile (it will fail if Lambda times out before a job reaches the desired state) and wasteful (it's mostly sleeping rather than doing anything useful, so costs more than it should). You can completely automate the transitions you want in a simple, optimal fashion using event-based solution based on CloudWatch Events and Lambda. — jarmod, Oct 19 '19 at 15:43

Lambda automatically deletes transcribe job upon completion

2 Answers2