How is Word Able to detect PDF structure so well where others fail? Is there a Library that can achieve this?

Question

I've been interested in Parsing PDFs for some time now with varying degrees of sucess. Often however with PDFs useful data is contained in the text i.e. outside Tables etc. If you are to get data out of the sentences however, it is vital that the sentences are not broken. The best way (only way to be honest) I have found is using Word but this is a seemingly sloppy solution which doesn't always recognise the PDF correctly.

I appreciate that Parsing PDFs is not a trivial thing to do, however Im suprised that there doesnt appear to be some library/tool that, like word, that can detect whole sentences and formatting i.e. if the text is in bold or font size.

Other command prompt Tools like XPDF reader, are great at instantly creating text files from PDFs and can even maintain the layout, but again are unable to detect if a sentence has been broken. I understand that there is nothing to actually detect as the PDF is just words on a page with no relationship.

Clearly this must be a somewhat difficult thing but this then begs the question, how does word do it so well? (not 100% but best I have come across)

And if word can do this, surely the same functionality has been implemented in a python library or similar? Perhaps I am being Naive, since increasingly there are various AI APIs to try and achieve this. Though I still feel that unlike an AI trying to parse an entire document, I would like to:

Take any PDF and Detect whole sentences (In a similar way to when opening a PDF file in word)
Detect formatting features, Bold, Font size etc (even if just a tag next to the sentence)
Export as a text file

From this it seems to me that you can then use the text file to Search for Heading (With a Bold tag) and capture text between two headings etc. and then in between this, search for keywords and use regex to extract associated content.

I haven't tested the import of PDFs into Word yet. Out of interest, therefore: You say that it works quite well. Have you checked your test files for a bias? E.g. there might be a large percentage of your test files being also _exported from Word_ which might make the re-import fairly easy. — mkl, Mar 10 '23 at 15:18
Please see: https://stackoverflow.com/questions/73628493/extracting-whole-sentences-from-pdfs-as-best-as-possible-plain-text-from-pdf/73666895#73666895 — Nick, Mar 10 '23 at 18:14
Word is the only tool I have come across (limited as typically using work computer) that recognises sentences as a whole. Especially if saving from Acrobat as a word doc and opening this. That being said I am getting more into python recently on personal pc and can see this becoming a major project of mine. — Nick, Mar 10 '23 at 18:17
Also see: https://stackoverflow.com/questions/73416591/power-querys-data-from-pdf-not-always-reliable-possible-to-iterate-over-url-li @mkl — Nick, Mar 10 '23 at 18:21
*"Especially if saving from Acrobat as a word doc and opening this."* - ah, but that sounds like the good recognition is not a property of word but instead of Acrobat. — mkl, Mar 11 '23 at 09:41
Although true, even if I do not use Acrobat, Word I would say works well 90% of the time without it. The issue is that if you copy and paste test from a PDF document, the sentences become broken with each line break. This is true if I use any text ectraction tool from PDFs. Word however seems to be smarter than this (yes without using acrobat beforehand) and maintains who sentences together (viewable by turning on paragraph marks on a document) — Nick, Mar 11 '23 at 09:49
When I then export from word a text file, whole sentences 99% of the time are kept together. which ensures that I get all of the necessary information when searching text for keywords. e.g. Searching for the keyword Lead: The level of Lead in the water was 100 ppm. And not: The level of Lead\n in the water was\n 100 ppm — Nick, Mar 11 '23 at 09:52
Have you checked the creator of the pdfs you observed that for? — mkl, Mar 12 '23 at 07:17
I have done it with various PDFs and doc types. Chemical SDS files from various suppliers which can vary in structure, standard PDFs with paragraphs etc are straightforward. Most PDFs I have come across Word hadles pretty well. Ocassionally it fails and recognises pages as images. but generally its good hence wondering how it does it. — Nick, Mar 12 '23 at 21:27

How is Word Able to detect PDF structure so well where others fail? Is there a Library that can achieve this?

0 Answers0