0

I have a mulit-page pdf on AWS S3, and am using textract to extract all text. I can get the response in batches, where the 1st response provides me with a 'NextToken' that I need to pass as an arg to the get_document_analysis method.

How do I avoid manually running the get_document_analysis method each time manually pasting the NextToken value received from the previous run?

Here's an attempt:

import boto3

client = boto3.client('textract')

# Get my JobId
test_output = client.start_document_text_detection(DocumentLocation = {'S3Object': {'Bucket':'myawsbucket', 'Name':'mymuli-page-pdf-file.pdf'}})['JobId']

def my_output():
    my_ls = []
    
    # I need to repeat the the following function until the break condition further below
    while True: 
        
        # This returns a dictionary, with one key named NextToken, which value will need to be passed as an arg to the next iteration of the function
        x=client.get_document_analysis(JobId = my_job_id_ref) 
        
        # Assinging value of NextToken to a variable
        next_token = x['NextToken'] 
        
        #Running the function again, this time with the next_token passed as an argument.
        x=client.get_document_analysis(JobId = my_job_id_ref, NextToken = next_token)
        
        # Need to repeat the running of the function until there is no token. The token is normally a string, hence
        if len(next_token) <1:
            break
        
        my_ls.append(x)
        
    return my_ls

Prolle
  • 358
  • 1
  • 10

1 Answers1

1

The trick is to use the while-condition to check whether the nextToken is empty.

# Get the analysis once to see if there is a need to loop in the first place
x=client.get_document_analysis(JobId = my_job_id_ref) 
next_token = x.get('NextToken')
my_ls.append(x)

# Now repeat until we have the last page
while next_token is not None:
    x = client.get_document_analysis(JobId = my_job_id_ref) 
    next_token = x.get('NextToken')
    my_ls.append(x)

The value of next_token will be continously overwritten, until it is None - at which point we break out of the loop.

Note that I'm using the x.get(..) to check if the response-dictionary contains the NextToken. It may not be set in the first place, in which case .get(..) will always return None. (x["NextToken"] will throw a KeyError if the NextToken is not set.)

Bert Blommers
  • 1,788
  • 2
  • 13
  • 19
  • thanks, that works. the resulting list is extremely large and crashes my Jupyter notebook (running on my local machine). I have started to look into AWS Lambda, and also PySpark etc but it looks to heavy for my occasional use case. What would be the fastest way to deal with large lists or must I go the cloud route? – Prolle Mar 30 '22 at 20:01
  • Difficult to say without knowing the full use case. Ideally, the result of `get_document_analysis` is processed immediately, so either persisted somewhere or send off to another service for further processing. – Bert Blommers Mar 31 '22 at 09:38
  • You could also look into trimming the result of the call. For example, you can drop the `NextToken` before adding `x` to the list, as that is meaningless at that point. There may be other attributes that are returned, but are not required for your use-case. https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.get_document_analysis – Bert Blommers Mar 31 '22 at 09:41