I've been interested in Parsing PDFs for some time now with varying degrees of sucess. Often however with PDFs useful data is contained in the text i.e. outside Tables etc. If you are to get data out of the sentences however, it is vital that the sentences are not broken. The best way (only way to be honest) I have found is using Word but this is a seemingly sloppy solution which doesn't always recognise the PDF correctly.
I appreciate that Parsing PDFs is not a trivial thing to do, however Im suprised that there doesnt appear to be some library/tool that, like word, that can detect whole sentences and formatting i.e. if the text is in bold or font size.
Other command prompt Tools like XPDF reader, are great at instantly creating text files from PDFs and can even maintain the layout, but again are unable to detect if a sentence has been broken. I understand that there is nothing to actually detect as the PDF is just words on a page with no relationship.
Clearly this must be a somewhat difficult thing but this then begs the question, how does word do it so well? (not 100% but best I have come across)
And if word can do this, surely the same functionality has been implemented in a python library or similar? Perhaps I am being Naive, since increasingly there are various AI APIs to try and achieve this. Though I still feel that unlike an AI trying to parse an entire document, I would like to:
- Take any PDF and Detect whole sentences (In a similar way to when opening a PDF file in word)
- Detect formatting features, Bold, Font size etc (even if just a tag next to the sentence)
- Export as a text file
From this it seems to me that you can then use the text file to Search for Heading (With a Bold tag) and capture text between two headings etc. and then in between this, search for keywords and use regex to extract associated content.