I know there are thousands of ways to extract text from .pdf file - there are online converters, libraries, packages and it is possible to do it in any programming language. For the needs of my thesis I am looking for the source that explains how it works - I found some presentation that text is basically anything that is between parenthesis but when I opend the .pdf file with some notepad I don't find it (actually there are no real words). Is there any work, article that describes how .pdf file works? What language is used? What are the layers of it? Can we create a .pdf file in some notepad from scratch - then just save it as .pdf and see it properly? How such pdf_to_text tools (ex. in R or even JavaScript) work from the inside? I will be so so grateful for any answers, help, links, explanations!
Asked
Active
Viewed 134 times
1
-
Every library will be different. Some just give you the text in the order of the drawing instructions in the PDF, some will give you the text based on the structure tags in the document. Still, others will attempt to infer the correct reading order based on the location of the words on the page. And yes, you can create a PDF file from scratch using Notepad (which I've done) The language is... here's a shocker... PDF. See this link https://www.adobe.com/technology/pdfs/presentations/KingPDFTutorial.pdf – joelgeraci Oct 21 '19 at 21:45
-
1You find starting points in [this answer](https://stackoverflow.com/a/55491402/1729265) and the answers referenced there. In essence, though: for a thesis on this topic you should really read the pdf specification, in particular the chapter on text. – mkl Oct 22 '19 at 04:59
-
@joelgeraci thank you I looked at everything. But basically when saving to pdf it is somehow compiled to pdf right? – heisenberg7584 Nov 23 '19 at 14:13
-
"Compiled"... In the computing sense? No. In the generic use of the word (assembling information collected from other sources), Yes. The layout of the document is interpreted and then written as drawing instructions in the PDF language such that the PDF, when displayed, will look like the document had been printed to a piece of electronic paper. Other information can be captured as well like the document structure, navigational aids, etc. – joelgeraci Nov 25 '19 at 15:18