0

I have a couple hundred transcribed results in aws transcribe and I would like to get all the transcribed text and store it in one file. Is there any way to do this without clicking on each transcribed result and copy and pasting the text?

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
waterbear
  • 3
  • 2

2 Answers2

2

You can do this via the AWS APIs.

For example, if you were using Python, you can use the Python boto3 SDK:

  • list_transcription_jobs() will return a list of Transcription Job Names
  • For each job, you could then call get_transcription_job(), which will provide the TranscriptFileUri that is the location where the transcription is stored.
  • You can then use get_object() to download the file from Amazon S3
  • Your program would then need to combine the content from each file into one file.

See how you go with that. If you run into any specific difficulties, post a new Question with the code and an explanation of the problem.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • But list_transcription_jobs() allows you to list only top 100 jobs. If I have 1000 jobs I want to list all 1000 jobs and save job names to a list. How to do that? – sachin kumar s Mar 12 '21 at 10:56
  • @sachinkumars Where did you see a mention that only 100 jobs are returned? It looks like subsequent job lists can be obtained by calling the function again and specifying the `NextToken` that was provided in the result set. This is referenced in the documentation as 'pagination'. – John Rotenstein Mar 12 '21 at 11:02
  • BadRequestException: An error occurred (BadRequestException) when calling the ListTranscriptionJobs operation: The next page token that you provided isn't valid. Check the token and try your request again. I get this error. I went with the documentation. Im not sure how to set the next page token. if I set it to NextToken=response['NextToken'], I get the last 100 jobs only not all 1000 jobs are listed. – sachin kumar s Mar 12 '21 at 11:09
  • i would like to make a list of all the 1000 job names in a list. how to achieve this? – sachin kumar s Mar 12 '21 at 11:10
  • @sachinkumars Please create a new question with full details, rather than asking this via a comment on an old question. – John Rotenstein Mar 12 '21 at 22:02
2

I put an example on GitHub that shows how to:

  • run an AWS Transcribe job,
  • use the Requests package to get the output,
  • write output to the console.

You ought to be able to refit if pretty easily for your purposes. Here's some of the code, but it'll make more sense if you check out the full example:

job_name_simple = f'Jabber-{time.time_ns()}'
print(f"Starting transcription job {job_name_simple}.")
start_job(
    job_name_simple, f's3://{bucket_name}/{media_object_key}', 'mp3', 'en-US',
    transcribe_client)
transcribe_waiter = TranscribeCompleteWaiter(transcribe_client)
transcribe_waiter.wait(job_name_simple)
job_simple = get_job(job_name_simple, transcribe_client)
transcript_simple = requests.get(
    job_simple['Transcript']['TranscriptFileUri']).json()
print(f"Transcript for job {transcript_simple['jobName']}:")
print(transcript_simple['results']['transcripts'][0]['transcript'])
Laren Crawford
  • 561
  • 4
  • 6