Async Textract in AWS Lambda

Question

How does this architecture handle a large backlog of pdfs to be processed by AWS Textract? If there's a large backlog of messages in the first queue, the first lambda (scheduled to run every x minutes) would start picking up messages to call and execute asynchronous StartDocumentAnalysis.

AWS Textract Architecture

The shortcoming of having the lambda with a schedule is that what happens if the pdf document is large and Textract takes longer than x minutes for it to process the document? In this scenario the lambda would consume the next message in the queue, start another async StartDocumentAnalysis call. There's the potential of hitting the Textract default concurrency limit of 2 StartDocumentAnalysis at a time.

I can make x minutes longer but is there a way to make this pipeline smarter? As in logic within the lambda to check the current number of concurrent Textract process running, then if there's enough concurrency, have the lambda consume the next message in the queue?

My solution ideally would need to account for 1000s of PDF documents uploaded to the source bucket, which would exceed the max region capacity of 600.

jarmod · Answer 1 · 2022-02-28T22:47:19.120

0

The quota/limit you are referring to is not a concurrency limit of 2 StartDocumentAnalysis at a time, but a limit of the number of transactions per second for all start (asynchronous) operations:

StartDocumentAnalysis: 10 in us-east-1/us-west-2, 2 elsewhere
StartDocumentTextDetection: 10 in us-east-1/us-west-2, 1 elsewhere
StartExpenseAnalysis: 5 in us-east-1/us-west-2, 1 elsewhere

The maximum number of asynchronous jobs per account that can simultaneously exist is 600 in us-east-1 and us-west-2, and 100 in all other regions.

edited Feb 28 '22 at 22:47

answered Feb 28 '22 at 19:53

jarmod

71,565
16
115
122

the link says specifies 10 txns/second for startDocumentAnalysis. Does this mean I can request my current limit of 2 to up to 10? – sprint5 Feb 28 '22 at 21:12
I misread the numbers slightly and have updated my answer. Yes, up to 10 StartDocumentAnalysis transactions/sec, with up to 600 concurrent jobs (in us-east-1/us-west-2). – jarmod Feb 28 '22 at 22:50

Async Textract in AWS Lambda

1 Answers1