1

I have some text in a pdf that has been OCR'ed. The OCR returns the bounding boxes of the words to me. I'm able to draw the bounding boxes (wordRect) on the pdf and everything seems correct.

But when i tell my fontsize to be the height of these bounding boxes, it all goes wrong. The text appears way smaller than it should be and doesn't match the height.

There's some conversion i am missing. How can i make sure the text is as high as the bounding boxes?

pdftron.PDF.Font font = pdftron.PDF.Font.Create(convertedPdf.GetSDFDoc(), pdftron.PDF.Font.StandardType1Font.e_helvetica);
for (int j = 0; j < ocrStream.pr_WoordList.Count; j++)
{
           wordRect = (Rectangle) ocrStream.pr_Rectangles[j];

           Element textBegin = elementBuilder.CreateTextBegin();
           gStateTextRun = textBegin.GetGState();
           gStateTextRun.SetTextRenderMode(GState.TextRenderingMode.e_stroke_text);
           elementWriter.WriteElement(textBegin);

           fontSize = wordRect.Height;
           double descent;

           if (hasColorImg)
           {
               descent = (-1 * font.GetDescent() / 1000d) * fontSize;
               textRun = elementBuilder.CreateTextRun((string)ocrStream.pr_WoordList[j], font, fontSize);

              //translate the word to its correct position on the pdf

              //the bottom line of the wordrectangle is the baseline for the font, that's why we need the descender
              textRun.SetTextMatrix(1, 0, 0, 1, wordRect.Left, wordRect.Bottom + descent );
DennisVA
  • 2,068
  • 1
  • 25
  • 35
  • Could you post a screen shot of what you see, and clearly indicate what you expected to see? Note that font size is a "scaling factor" and does not explicitly set the size of the text. That depends on each glyph themselves. – Ryan Jan 11 '18 at 19:32
  • Is this still an issue for you? I would like to assist, but it is unclear what exactly you expect as the output? The exact height of a glyph, depends not just on the font and font size, but the particular glyph (e.g. `a` versus `A`). Are your bounding boxes per glyph? Again, a screen shot would help a lot, showing what you got, and what you expected to get. – Ryan Jan 15 '18 at 18:13
  • I found another solution. Thanks for the info Ryan, I really appreciate your help – DennisVA Jan 16 '18 at 09:38
  • If you can, you can answer your own question, I am sure it would be useful to others. – Ryan Jan 16 '18 at 18:37

1 Answers1

0

How can i make sure the text is as high as the bounding boxes?

The font_size is just a scaling factor, which in most cases does map to 1/72 inch (pt), but not always.

The transformations are: GlyphSpace -> TextSpace -> UserSpace (where UserSpace is essentially the page space, and is 1/72 inch)

The glyphs in the font are defined in GlyphSpace, and there is a font matrix that maps to TextSpace. Typically, 1000 units maps to 1 unit in test space, but not always.

Then the text matrix (element.SetTextMatrix), the font size (variable in question here) and some additional parameters, transform TextSpace coordinates to UserSpace.

In the end though, the exact height, depends on the glyph also.

This forum post shows how to go from the glyph data, to UserSpace. See ProcessElements https://groups.google.com/d/msg/pdfnet-sdk/eOATUHGFyqU/6tsUF0BHukkJ

Ryan
  • 2,473
  • 1
  • 11
  • 14
  • Thanks this is a really good thing to know for later on. I didn't want to share my solution because its rather a workaround – DennisVA Jan 17 '18 at 08:12