In the package pdftools
, there are two functions pdf_data()
(which works on pre-OCR'd PDF files) and pdf_ocr_data()
(which will OCR a PDF file regardless of whether it is already OCR'd or not).
pdf_data()
results in a list of tibbles, each with 6 fields: width, height, x, y, space, and text. Ex output:// A tibble: 2 x 6 width height x y space text <int> <int> <int> <int> <lgl> <chr> 1 51 12 15 65 TRUE Text1 2 59 12 70 65 FALSE Text2
pdf_ocr_data()
results in a list of tibbles with 3 fields: word, confidence, and bbox. Ex output://A tibble: 2 x 3 word confidence bbox <chr> <dbl> <chr> 1 Text1 96.8 136,546,551,647 2 Text2 96.7 590,545,1078,625
Per Using the pdf_data function from the pdftools package efficiently I have confirmed that the pdf_data()
x and y fields are the coordinates based off of the distance from the top left corner, which is 0,0. However, I'm not sure what the units are.
The documentation for pdf_ocr_data()
function explains the function's arguments, but not the output. While word and confidence seem relatively self explanatory, I’m getting stuck on figuring out what the elements of bbox are. It seems to have something to do with the word coordinates, however the ex. output I provided above are the results for the first 2 words of the same PDF page and as you can see the results are different. In my testing it seems like the first two values of bbox relate to the x and y, respectively, with ratios of bbox[1]/x and bbox[2]/y ranging from 8.4 to just over 9.
So my questions are as follows:
- What are the units of x and y as provided by
pdf_data()
? - What are the values displayed in the bbox field produced by
pdf_ocr_data
? - How are the values related, precisely?
P.S. I have not provided reproduceable code here since these are questions regarding the nature of the output and not questions regarding troubleshooting or how to solve an issue. There is a forum post here: https://discuss.ropensci.org/t/text-vs-word-xy-coordinate-differences-between-pdf-data-and-pdf-ocr-data/3518 asking similar questions, but there are no replies to that post.