0

I am using Camelot to parse a document. To keep it simple, I am now debugging with the most basic command:

all_pages = camelot.read_pdf(str(file_path))
for table_info in all_pages:
    df = table_info.df
    print(df)

I am applying this to two different PDFs, which look very much the same. Their metadata is identical:

  • Producer: Acrobat Distiller 17.0 (Windows)
  • Creator: PScript5.dll Version 5.2.2
  • Format: PDF-1.3
  • Size: A4, Portrait (210 × 297 mm)

Only the date and size of the documents are different. They contain a table, with the same layout. It only changes slightly in size. Even the data within cells is the same! (I can't attach a PDF, but here is a jpg version):

The table in question

With the older PDF file things go well, and I get words, numbers, etc. But with the newer one I only get weird encoding stuff like "(cid:12)(cid:13)(cid:14)".

I have looked through the documentation, but I can't find anything related to this problem or to encoding in general.

Pablo
  • 1,373
  • 16
  • 36
  • 1
    CID stands for character ID. PDF creation preserves the font. Say if I used Lato font when creating the PDF and you opened it in your laptop which does not have Lato font installed, then it will through this. The only way to extract the characters is through the OCR process. – ExtractTable.com Mar 04 '22 at 13:48
  • I may have to contact the people who created this document... thanks for the comment – Pablo Mar 04 '22 at 14:33

0 Answers0