0

So I spent some time trying to extract data using PyPDF2 but this ended up being unreliable across pdfs even if the pdfs looked (to the eye) like they had similar structure and are probably computer generated.

The thing I liked about PyPDF2 is that it goes through the pdf file and pulls in the text from the various objects so you don't have to deal with spacing etc between characters (as far as I can understand) extractText PyPDF2 function.

Camelot on the other hand according to the docs uses pdfminer which as far as I understand doesn't do the above but tries to group different parts of the pdf together from characters into words into lines depending on distancing rules. The problem I experienced with Camelot is that you get results like "He l lo Wo rld".

Unfortunately I can't share a pdf example online

Let me know what other information would be helpful to share

evan54
  • 3,585
  • 5
  • 34
  • 61

1 Answers1

0

Not a perfect answer but in case others end up here. One thing I found helpful when searching for text and matching it is removing all whitespace.

So if I'm looking for "Hello World" but I get "He l lo Wo rld" by removing whitespace they're actually identical.

this solved my problems

evan54
  • 3,585
  • 5
  • 34
  • 61