0

I'm trying to parse a pdf using Smalot PDF Parser but the problem is that the text is not formatted well. It is showing spaces between letters of words.
For example: The word "Letter" is written as "L e tt e r". How I can correct it?
Moreover, the documentation provided by the Smalot PDF Parser is not enough. I need some more documentation for the detailed implementation of PDF Parser. Kindly give me more documentation if anybody have it. Thanks !

Ozair Kafray
  • 13,351
  • 8
  • 59
  • 84

1 Answers1

0

Trying to extract text from a PDF is always hard. This is because PDF documents are not a WYSIWYG format, you should think of them more as a container of instructions.

Extracting text means 'replaying' those instructions to find out what letters are being drawn at what positions, and then applying some heuristics to determine things like "these letters are close to each other, they should be concatenated".

Does it have to be php?

Joris Schellekens
  • 8,483
  • 2
  • 23
  • 54