Error in Text format while parsing PDF using Smalot PDF parser

Question

I'm trying to parse a pdf using Smalot PDF Parser but the problem is that the text is not formatted well. It is showing spaces between letters of words.
For example: The word "Letter" is written as "L e tt e r". How I can correct it?
Moreover, the documentation provided by the Smalot PDF Parser is not enough. I need some more documentation for the detailed implementation of PDF Parser. Kindly give me more documentation if anybody have it. Thanks !

score 0 · Answer 1 · answered Sep 20 '17 at 08:23

0

Trying to extract text from a PDF is always hard. This is because PDF documents are not a WYSIWYG format, you should think of them more as a container of instructions.

Extracting text means 'replaying' those instructions to find out what letters are being drawn at what positions, and then applying some heuristics to determine things like "these letters are close to each other, they should be concatenated".

Does it have to be php?

answered Sep 20 '17 at 08:23

Joris Schellekens

8,483
2
23
54

Yes sir. It should be in php. I dont know how to apply heuristics. Please send me code. – Saqib Javed Sep 28 '17 at 06:09
StackOverflow is not an outsourcing company. You can not just ask for code without showing us what you've done yourself. – Joris Schellekens Sep 28 '17 at 06:39

Error in Text format while parsing PDF using Smalot PDF parser

1 Answers1