I have a mulit-page pdf on AWS S3, and am using textract to extract all text. I can get the response in batches, where the 1st response provides me with a 'NextToken' that I need to pass as an arg to the get_document_analysis method.
How do I avoid manually running the get_document_analysis method each time manually pasting the NextToken value received from the previous run?
Here's an attempt:
import boto3
client = boto3.client('textract')
# Get my JobId
test_output = client.start_document_text_detection(DocumentLocation = {'S3Object': {'Bucket':'myawsbucket', 'Name':'mymuli-page-pdf-file.pdf'}})['JobId']
def my_output():
my_ls = []
# I need to repeat the the following function until the break condition further below
while True:
# This returns a dictionary, with one key named NextToken, which value will need to be passed as an arg to the next iteration of the function
x=client.get_document_analysis(JobId = my_job_id_ref)
# Assinging value of NextToken to a variable
next_token = x['NextToken']
#Running the function again, this time with the next_token passed as an argument.
x=client.get_document_analysis(JobId = my_job_id_ref, NextToken = next_token)
# Need to repeat the running of the function until there is no token. The token is normally a string, hence
if len(next_token) <1:
break
my_ls.append(x)
return my_ls