Camelot in python does not behave as expected

Question

I have two pdf documents, both in same layout with different information. The problem is: I can read one perfectly but the other one the data is unrecognizable.

This is an example which I can read perfectly, download here:

from_pdf = camelot.read_pdf('2019_05_2.pdf', flavor='stream', strict=False)
df_pdf = from_pdf[0].df


camelot.plot(from_pdf[0], kind='text').show()
print(from_pdf[0].parsing_report)

This is the dataframe as expected:

This is an example which after I read, the information is unrecognizable, download here:

from_pdf = camelot.read_pdf('2020_04_2.pdf', flavor='stream', strict=False)
df_pdf = from_pdf[0].df


camelot.plot(from_pdf[0], kind='text').show()
print(from_pdf[0].parsing_report)

This is the dataframe with unrecognizable information:

I don't understand what I have done wrong and why the same code doesn't work for both files. I need some help, thanks.

cid is basically the character identity. Your PDF doesn't seem to be rightly constructed that resulted in not saving the fonts it has used. Report it to the team who has built it or convert the PDF to image and try OCR-ing. — ExtractTable.com, Sep 10 '21 at 12:12

score 1 · Accepted Answer · answered Sep 10 '21 at 08:44

The problem: malformed PDF

Simply, the problem is that your second PDF is malformed / corrupted. It doesn't contain correct font information, so it is impossible to extract text from your PDF as is. It is a known and difficult problem (see this question).

You can check this by trying to open the PDF with Google Docs.

Google Docs tries to extract the text and this is the result:.

Possible solutions

If you want to extract the text, you can print the document to an image-based PDF and perform an OCR text extraction. However, Camelot does not currently support image-based PDFs, so it is not possible to extract the table.

If you have no way to recover a well-formed PDF, you could try this strategy:

print PDF to an image-based PDF
add a good text layer to your image-based PDF (using OCRmyPDF)
try using Camelot to extract tables

I'll try and tell you what worked for me. – Gizelly Sep 10 '21 at 12:03 — Gizelly, Sep 10 '21 at 12:03

Camelot in python does not behave as expected

1 Answers1

The problem: malformed PDF

Possible solutions