I'm exploring options for semi-automated redaction of PDFs using various NLP techniques, and have been using PyMuPDF with Tesseract via ocrmypdf for OCR. This works pretty well overall, but management want to try Textract as an alternative. It's easy enough to call it against a single page of a PDF and read the resulting dictionary, but there's no simple way (that I've found yet) for mapping that back into the PDF as invisible text to create a searchable version of the page (all of which ocrmypdf does automatically).
For reference, here's an example of the dict that Textract produces. A given entry can be either a WORD or LINE.
'Id': 'be018daa-02c9-47d2-903a-73b69bdaa181',
'Text': "owners'",
'TextType': 'PRINTED'},
{'BlockType': 'WORD',
'Confidence': 95.73345947265625,
'Geometry': {'BoundingBox': {'Height': 0.014128071255981922,
'Left': 0.7538964748382568,
'Top': 0.7295616269111633,
'Width': 0.08705723285675049},
'Polygon': [{'X': 0.7539187669754028,
'Y': 0.7295616269111633},
{'X': 0.8409537076950073,
'Y': 0.7295762896537781},
{'X': 0.8409309983253479,
'Y': 0.7436897158622742},
{'X': 0.7538964748382568,
'Y': 0.7436745166778564}]},
Has anyone done this in Python, or have suggestions?
I'm working through various options. One mechanism I was thinking of was using the polygon coordinates provided for each LINE or WORD to create a new PyMuPDF Rect, then calling insertTextbox() against that rectangle.
But then there's the problem of font size/face and making sure it all aligns, which means identifying what font was detected and its size.
We also have the problem that our PDFs come from a variety of uncontrolled sources, and can variously contain 100% searchable, 100% image-only, or a mix of page types. And they can be produced by a whole range of applications, so there's no single option that will likely cover everything.