0

I am using Google Document AI to process PDF documents. After sending a PDF document, Google sends a JSON reply containing the detected text and the exact location of each word. This is a sample JSON response:

Screenshot of JSON response

{
    "uri": "",
    "mimeType": "application/pdf",
    "text": "Suppose that life is absurd for the reasons that Camus claims. If that were the case, do you\nthink Camus's response is 
    appropriate? If you agree with Camus, discuss at least one\nobjection to his proposed response and reply to it. If you do not 
    agree, say why, and briefly\ndescribe what you think might be a more fitting response.\nIn the midst of all chaos in the world, no 

We see that the part of interest ("In the midst") contains a single space between each word.

Now using this JSON response, I try to write every single word at its exact location on the document to make a scanned PDF seachable. But in some locations, when I Ctrl + F the document, I need to add 2 spaces between words. So instead of querying "In the midst" I need to look for "In the midst".

Single space query:

Single space query

Double space query:

Double space query

The tokens I pass in to be written don't contain any spaces. I write "In" and not "In " or " In"

This is what the code responsible for writing the code looks like:

for i in range(len(a)): # Loop through pages
  for j in range(len(a[i])): # Loop through words in page
    token = a[i][j]
    can.drawString(token["x"], token["y"], token["text"])

Where token holds the data of the word to be written.

  • token["x"]: x position

  • token["y"]: y position

  • token["text"]: text to write

How is it possible for an extra space to be added when token["text"] doesn't contain any spaces.

Moreover, this issue only happens on certain instances. The following screenshot shows how the query is successful with single spaces.

Successful single spaced query:

Successful single spaced query

double-beep
  • 5,031
  • 17
  • 33
  • 41
Aymane
  • 1
  • 2
  • I don't the think issue is with the OCR. The Google Document AI json response is pretty accurate. You can see on the first screenshot that the response is single spaced. The problem is unusual because there does not seem to be any reason why certain text sequences work with single space while others need double space. The example above shows how the sentence "In the midst" needs double spaces between words, while the last link shows a successful single spaced query. Could you explain what you mean by "plain text without positional spaces"? Thanks! – Aymane Jun 06 '22 at 16:37
  • 1
    This is not single vs double spacing. This is just what happens with justified text. Reportlab places the words individually so that the spacing between words is equal across the line. It doesn't use spaces. PDFs were designed to be printed, not scraped, so the PDF readers have to GUESS what the input text was. The actual input text is not present in the PDF. Some documents even have the letters placed individually. You will have to adjust your matching. – Tim Roberts Aug 02 '22 at 21:22

1 Answers1

0

FYI, Document AI has an actively monitored tag [cloud-document-ai]


Not 100% sure on this, but I recommend checking the Token.DetectedBreak field. The Type has an Enum for the type of break detected, which has an option for both a regular space and a Wide Space. It could be worth checking which type of break is being detected.

The Code Samples have also been updated recently, which shows how to access all of the OCR data from the Document AI output.

https://cloud.google.com/document-ai/docs/handle-response#code_samples

Holt Skinner
  • 1,692
  • 1
  • 8
  • 21