I have thousands of scanned personal files in PDF format that are searchable. The goal should be that I check each page in the PDF file and split or extract it into one of 50 categories. For example, I want to export all pages that have something to do with health insurance into a pdf file named Health Insurance. All pages that have something to do with pension into a corresponding file and so on. I am still thinking about how I could automate this most cleverly. At the moment my approach was: I write a Python Script. In which I specify for each category certain keywords and if they are found, the pages are then exported accordingly. But I wonder if this is the best way for the task or if there are already other solutions before I develop the stuff again. I'm afraid that I would have to create an incredible number of rules to recognize all form data for thousands of employees. Sometimes keywords do not only appear on the page I am looking for, but also in completely different areas. Should the problem perhaps be approached differently? Are there any ready-made solutions? The goal should be to automate as much as possible. I would be grateful for tips.
A wordbased/phrasebased search for categories in Python would probably work in principle. I already tried that, but there are a lot of detail problems.
I'm looking for ready solutions, or other approaches how to solve the problem if necessary.