0

I'm exploring options for semi-automated redaction of PDFs using various NLP techniques, and have been using PyMuPDF with Tesseract via ocrmypdf for OCR. This works pretty well overall, but management want to try Textract as an alternative. It's easy enough to call it against a single page of a PDF and read the resulting dictionary, but there's no simple way (that I've found yet) for mapping that back into the PDF as invisible text to create a searchable version of the page (all of which ocrmypdf does automatically).

For reference, here's an example of the dict that Textract produces. A given entry can be either a WORD or LINE.

'Id': 'be018daa-02c9-47d2-903a-73b69bdaa181',
             'Text': "owners'",
             'TextType': 'PRINTED'},
            {'BlockType': 'WORD',
             'Confidence': 95.73345947265625,
             'Geometry': {'BoundingBox': {'Height': 0.014128071255981922,
                                          'Left': 0.7538964748382568,
                                          'Top': 0.7295616269111633,
                                          'Width': 0.08705723285675049},

                          'Polygon': [{'X': 0.7539187669754028,
                                       'Y': 0.7295616269111633},
                                      {'X': 0.8409537076950073,
                                       'Y': 0.7295762896537781},
                                      {'X': 0.8409309983253479,
                                       'Y': 0.7436897158622742},
                                      {'X': 0.7538964748382568,
                                       'Y': 0.7436745166778564}]},

Has anyone done this in Python, or have suggestions?

I'm working through various options. One mechanism I was thinking of was using the polygon coordinates provided for each LINE or WORD to create a new PyMuPDF Rect, then calling insertTextbox() against that rectangle.

But then there's the problem of font size/face and making sure it all aligns, which means identifying what font was detected and its size.

We also have the problem that our PDFs come from a variety of uncontrolled sources, and can variously contain 100% searchable, 100% image-only, or a mix of page types. And they can be produced by a whole range of applications, so there's no single option that will likely cover everything.

REJ
  • 1
  • 2

1 Answers1

0

I have done that many times using PyMuPDF. There are a few things to watch out for:

  1. Textract recognizes no fonts - so you have to decide which one to take for your insertions
  2. Textract delivers bboxes of lines and words, no fontsize. You have to compute the one that causes fitting the text in the (recomputed) bbox on output
  3. Textract coordinates are all between 0 and 1. You need your original page dimension to transform Textract coordinates to output coordinates.

Once you have solutions for the above (using PyMuPDF makes it fairly simple), insert text to your output page using page.insert_text() in PyMuPDF with render mode 3: this causes the text to be invisible.

For point 3 above use a PyMuPDF rectangle method: matrix = fitz.Rect(0, 0, 1, 1).torect(page.rect). If you then take a Textract boundary box, make a PyMuPDF-compatible rectangle of it with top-left coordinates (x0, y0) and bottom-left coordinates (x1, y1): textract_rect = fitz.Rect(x0, y0, x1, y1). Then the following gives you the corresponding bbox on your output page: bbox = textreact_rect * matrix.

Suggest you use font Helvetica for output: font = fitz.Font("helv").

If you have your text and its output bbox, compute the font size like this: textlen = font.text_length(text,fontsize=1) to get output length if fontsize where 1. Then bbox.width / textlen should give you a good value for the fontsize to take.

Next problem is the insertion point (needed for page.insert_text()).

bbox.bl (bottom left point) is a good start, but if your text contains characters descending below the base line (e.g. g, y, etc.), you need to adjust the insertion point upwards a little. Use font.descender and computed fontsize to compute this.

Jorj McKie
  • 2,062
  • 1
  • 13
  • 17
  • Thanks Jorj...I'd gotten most of this sorted out but was also working through how to map the font since image-only pages won't have any font data, but using Helv as the default should work fine. I'm actually using a similar strategy already for redacting signature blocks identified by Textract, so this should be pretty trivial to implement. – REJ Jun 08 '23 at 18:20
  • Follow-on -- does anyone know of a way to check whether a PDF has been previously OCR'd and/or "natively" contains text? We have a lot of docs containing a mix of text-native (i.e. exported from Word etc', and thus with non-OCR text), OCR'd, and image-only pages and I want to OCR only those pages in the latter two categories if possible. Checking the font catalog and length of text on a given page are two indicators, but previously OCR'd docs will pass both those tests. I know ocrmypdf can sometimes identify prior OCR, but it doesn't seem consistent. – REJ Jun 08 '23 at 20:25
  • Using PyMuPDF, you canat least check for the presence of Tesseract's `GlyphlessFont`. In addition, you can build a list of all boundary boxes on the pages `bboxes = page.get_bboxlog()`. Every item of the returned list is `(btype, bbox)`. Where btypes are low-level terms like "fill-text", "stroke-path", etc. If you find `btype == "ignore-text"` in that list, then OCR text is present (sufficient condition, not a necessary one). If there is "fill-text" or "stroke-text", then normal text is present. – Jorj McKie Jun 09 '23 at 19:06