I am using Google Document AI to process PDF documents. After sending a PDF document, Google sends a JSON reply containing the detected text and the exact location of each word. This is a sample JSON response:
{
"uri": "",
"mimeType": "application/pdf",
"text": "Suppose that life is absurd for the reasons that Camus claims. If that were the case, do you\nthink Camus's response is
appropriate? If you agree with Camus, discuss at least one\nobjection to his proposed response and reply to it. If you do not
agree, say why, and briefly\ndescribe what you think might be a more fitting response.\nIn the midst of all chaos in the world, no
We see that the part of interest ("In the midst"
) contains a single space between each word.
Now using this JSON response, I try to write every single word at its exact location on the document to make a scanned PDF seachable. But in some locations, when I Ctrl + F the document, I need to add 2 spaces between words. So instead of querying "In the midst"
I need to look for "In the midst"
.
Single space query:
Double space query:
The tokens I pass in to be written don't contain any spaces. I write "In" and not "In " or " In"
This is what the code responsible for writing the code looks like:
for i in range(len(a)): # Loop through pages
for j in range(len(a[i])): # Loop through words in page
token = a[i][j]
can.drawString(token["x"], token["y"], token["text"])
Where token holds the data of the word to be written.
token["x"]
: x positiontoken["y"]
: y positiontoken["text"]
: text to write
How is it possible for an extra space to be added when token["text"]
doesn't contain any spaces.
Moreover, this issue only happens on certain instances. The following screenshot shows how the query is successful with single spaces.
Successful single spaced query: