1

In the package pdftools, there are two functions pdf_data() (which works on pre-OCR'd PDF files) and pdf_ocr_data() (which will OCR a PDF file regardless of whether it is already OCR'd or not).

  • pdf_data() results in a list of tibbles, each with 6 fields: width, height, x, y, space, and text. Ex output:

           // A tibble: 2 x 6
           width  height  x      y      space  text      
           <int>  <int>   <int>  <int>  <lgl>  <chr>      
    1      51     12      15     65     TRUE   Text1   
    2      59     12      70     65     FALSE  Text2   
    
  • pdf_ocr_data() results in a list of tibbles with 3 fields: word, confidence, and bbox. Ex output:

           //A tibble: 2 x 3
           word   confidence bbox             
           <chr>  <dbl>      <chr>            
    1      Text1  96.8       136,546,551,647  
    2      Text2  96.7       590,545,1078,625 
    

Per Using the pdf_data function from the pdftools package efficiently I have confirmed that the pdf_data() x and y fields are the coordinates based off of the distance from the top left corner, which is 0,0. However, I'm not sure what the units are.

The documentation for pdf_ocr_data() function explains the function's arguments, but not the output. While word and confidence seem relatively self explanatory, I’m getting stuck on figuring out what the elements of bbox are. It seems to have something to do with the word coordinates, however the ex. output I provided above are the results for the first 2 words of the same PDF page and as you can see the results are different. In my testing it seems like the first two values of bbox relate to the x and y, respectively, with ratios of bbox[1]/x and bbox[2]/y ranging from 8.4 to just over 9.

So my questions are as follows:

  1. What are the units of x and y as provided by pdf_data()?
  2. What are the values displayed in the bbox field produced by pdf_ocr_data?
  3. How are the values related, precisely?

P.S. I have not provided reproduceable code here since these are questions regarding the nature of the output and not questions regarding troubleshooting or how to solve an issue. There is a forum post here: https://discuss.ropensci.org/t/text-vs-word-xy-coordinate-differences-between-pdf-data-and-pdf-ocr-data/3518 asking similar questions, but there are no replies to that post.

zx485
  • 28,498
  • 28
  • 50
  • 59
pseudorandom
  • 142
  • 1
  • 1
  • 10
  • 1
    With neither code nor sample input, you're restricting yourself to answers from only those who know `pdftools` rather than the wider audience who may be able to figure out what's going on given enough information. At a guess, `bbox` might be "bounding box" with the four values representing x and y co-ordinates for top left and bottom right (or bottom left and top right) corners. The default measurement unit in PDF files appears to be 1/72 of an inch. Voting to close due to missing debugging information. – Limey Jun 13 '23 at 17:03
  • 1
    Again, this is not something that needs debugged. And yes, I was indeed hoping for responses from the audience that know ```pdftools``` since this is a ```pdftools``` specific question. And yes, I am aware of PDF units being 1/72 of an inch, which can clearly only at most apply to one of these outputs. – pseudorandom Jun 13 '23 at 17:05
  • 1
    Well, `pdftools` is a widely known utility for processing PDF files. I consider this question to be answerable. Would vote to ReOpen if closed. – zx485 Jun 13 '23 at 17:06
  • Thank you @zx485 – pseudorandom Jun 13 '23 at 17:08
  • @Limey not every question needs reproducible data. This is a good example of a clear, concise, non-subjective, answerable question about a programming tool - even though the question doesn't have data. A link or two to relevant source repositories might be nice. It may be possible to attempt to answer this question through trial and error by generating PDFs, running these functions on them, and making inferences, but a better answer would be a link to some documentation or underlying code for these utilities or their dependencies - which doesn't require a MCVE. – Gregor Thomas Jun 13 '23 at 18:16

1 Answers1

0

If we take the example Data set we can see that in that case the values are the Top Left corners of the Text Tiles so here we see the pdf_data is 154 x 139

enter image description here

This implies the text is an Em size of 8
However if we inspect the source PDF we see the real values are that the text is 9.9626 points at a scale, unknown to us (without knowing what a unit is since units are not simple constants in a PDF). Thus we can surmise the file is designed for a media of [0 0 612 792], which means the origin is defaulting to lower left and intended to be used as if American Letter size.

BT
/F8 9.9626 Tf 154.69 646.077 Td [(Mazda)-333(RX4)]TJ
ET
BT
/F8 9.9626 Tf 154.69 633.724 Td [(Mazda)-333(RX4)-334(W)84(ag)]TJ
ET
BT
/F8 9.9626 Tf 154.69 621.37 Td [(Datsun)-333(710)]TJ
ET
BT
/F8 9.9626 Tf 154.69 609.016 Td [(Hornet)-333(4)-334(Driv)28(e)]TJ
ET

If we inspect Font 8 we see it is Computer Modern 10 point thus we can say the scale of those letters is 99.6% of their original true scale. and if we multiply 154.69 by that ratio we get the above reported 154 units from Left.

For the height we can subtract 139 from 792 so top left is @ 653 above origin and If we take lower left as at 646.077 there is an awkward differential as 6.923 which is not near the letters height, thus we can pre-sume the scaling difference between two unit bases is again at play.

For OCR the work areas for single letters or grouped as part words or full words can vary considerably since glyph less text will not be as consistently scaled compared to a font with glyphs.

The height of a theoretic bounding box may not reflect the true height, as it could be 1 point high for a 10 point character. Thus OCR co-ordinates should always be compared to the expected source output within the resultant file.

enter image description here

Now if we compare above apples with OCR pears, none of the scales will be the same, as they are approximations.

enter image description here

K J
  • 8,045
  • 3
  • 14
  • 36
  • @K J Thank you for your detailed response. It looks like you were assessing the units of the x and y coords resulting from ```pdf_data()```? Attempting to summarise your response here, are you saying that (1) Coords are relative to the top-left corner or the bottom-left? (2) x-y coords returned by ```pdf_data()``` depend on the font sizes of the text itself and therefore are not consistently approached by the two different functions? (3) and x-y coords returned are also based on the actual page size, so must be taken relative to that? – pseudorandom Jun 14 '23 at 17:10