Categorize the pages of a PDF file

Question

I have thousands of scanned personal files in PDF format that are searchable. The goal should be that I check each page in the PDF file and split or extract it into one of 50 categories. For example, I want to export all pages that have something to do with health insurance into a pdf file named Health Insurance. All pages that have something to do with pension into a corresponding file and so on. I am still thinking about how I could automate this most cleverly. At the moment my approach was: I write a Python Script. In which I specify for each category certain keywords and if they are found, the pages are then exported accordingly. But I wonder if this is the best way for the task or if there are already other solutions before I develop the stuff again. I'm afraid that I would have to create an incredible number of rules to recognize all form data for thousands of employees. Sometimes keywords do not only appear on the page I am looking for, but also in completely different areas. Should the problem perhaps be approached differently? Are there any ready-made solutions? The goal should be to automate as much as possible. I would be grateful for tips.

A wordbased/phrasebased search for categories in Python would probably work in principle. I already tried that, but there are a lot of detail problems.

I'm looking for ready solutions, or other approaches how to solve the problem if necessary.

The LEADTOOLS [Document Analyzer](https://www.leadtools.com/help/sdk/tutorials/dotnet-console-parse-data-with-the-document-analyzer.html) is able to extract the text from the PDFs either through direct parsing or through OCR recognition. It then can extract specific words through pattern recognition into a collection of [ElementResult](https://www.leadtools.com/help/sdk/dh/doxa/elementresult.html) objects for each recognized result. You can also check the *ListOfBounds* property to see if the found word is in the expected location. (Disclaimer: I am an employee of the vendor) — Hussam Barouqa, Aug 17 '23 at 15:19

Categorize the pages of a PDF file

0 Answers0