0

I have two pdf documents, both in same layout with different information. The problem is: I can read one perfectly but the other one the data is unrecognizable.

This is an example which I can read perfectly, download here: enter image description here

from_pdf = camelot.read_pdf('2019_05_2.pdf', flavor='stream', strict=False)
df_pdf = from_pdf[0].df


camelot.plot(from_pdf[0], kind='text').show()
print(from_pdf[0].parsing_report)

enter image description here

This is the dataframe as expected:

enter image description here

This is an example which after I read, the information is unrecognizable, download here: enter image description here

from_pdf = camelot.read_pdf('2020_04_2.pdf', flavor='stream', strict=False)
df_pdf = from_pdf[0].df


camelot.plot(from_pdf[0], kind='text').show()
print(from_pdf[0].parsing_report)

enter image description here

This is the dataframe with unrecognizable information:

enter image description here

I don't understand what I have done wrong and why the same code doesn't work for both files. I need some help, thanks.

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
Gizelly
  • 417
  • 2
  • 10
  • 24
  • cid is basically the character identity. Your PDF doesn't seem to be rightly constructed that resulted in not saving the fonts it has used. Report it to the team who has built it or convert the PDF to image and try OCR-ing. – ExtractTable.com Sep 10 '21 at 12:12

1 Answers1

1

The problem: malformed PDF


Simply, the problem is that your second PDF is malformed / corrupted. It doesn't contain correct font information, so it is impossible to extract text from your PDF as is. It is a known and difficult problem (see this question).

You can check this by trying to open the PDF with Google Docs. enter image description here

Google Docs tries to extract the text and this is the result:enter image description here.

Possible solutions


If you want to extract the text, you can print the document to an image-based PDF and perform an OCR text extraction. However, Camelot does not currently support image-based PDFs, so it is not possible to extract the table.

If you have no way to recover a well-formed PDF, you could try this strategy:

  • print PDF to an image-based PDF
  • add a good text layer to your image-based PDF (using OCRmyPDF)
  • try using Camelot to extract tables