Textract async read PDF

Question

From the textract documentation: Documents for synchronous operations can be in PNG or JPEG format. Documents for asynchronous operations can also be in PDF format.

I have a Node.js application where I use async Textract to read PDF file. My code looks like this:

import * as AWS from 'aws-sdk';

const textract = new AWS.Textract({ region: '<REGION>' });

export const callTextract = (file: File, uuid: string): Promise<any> => {
  return new Promise<any>((resolve, reject) => {
    const params = {
      Document: {
        Bytes: file,
      },
    };
    textract.detectDocumentText(params, (err, data) => {
      ....
      resolve(data);
    });
  })
}

The file here has already been read from the OS and is in Buffer format. I can confirm that it is a PDF file due to the first 4 bytes (Detecting file type from buffer in node js?):

 <Buffer 25 50 44 46 ... >

The error I receive is UnsupportedDocumentException.

score 0 · Answer 1 · answered Jul 09 '20 at 03:46

detectDocumentText() is synchronous. The async version is startDocumentTextDetection().

See doc:

Detects text in the input document. Amazon Textract can detect lines of text and the words that make up a line of text. The input document must be an image in JPEG or PNG format.

...

DetectDocumentText is a synchronous operation. To analyze documents asynchronously, use StartDocumentTextDetection.

Note the async mechanism of the language is not the same as async invocations of the API. For async APIs, there will always be at least two calls. In this case, the other one is getDocumentTextAnalysis().

...though I'd consider this as yet another example of bad AWS documentation.

score 0 · Answer 2 · answered Aug 22 '20 at 06:48

You can give a bytes field in both synchronous and asynchronous APIs, but the bytes field definition is the same throughout both APIs

A blob of base64-encoded document bytes. The maximum size of a document that's provided in a blob of bytes is 5 MB. The document bytes must be in PNG or JPEG format.

Therefore you cannot upload a Bytes field value of PDF format

From the documentation: https://docs.aws.amazon.com/textract/latest/dg/API_Document.html#API_Document_Contents

Textract async read PDF

2 Answers2