I am using Camelot to parse a document. To keep it simple, I am now debugging with the most basic command:
all_pages = camelot.read_pdf(str(file_path))
for table_info in all_pages:
df = table_info.df
print(df)
I am applying this to two different PDFs, which look very much the same. Their metadata is identical:
- Producer: Acrobat Distiller 17.0 (Windows)
- Creator: PScript5.dll Version 5.2.2
- Format: PDF-1.3
- Size: A4, Portrait (210 × 297 mm)
Only the date and size of the documents are different. They contain a table, with the same layout. It only changes slightly in size. Even the data within cells is the same! (I can't attach a PDF, but here is a jpg version):
With the older PDF file things go well, and I get words, numbers, etc. But with the newer one I only get weird encoding stuff like "(cid:12)(cid:13)(cid:14)".
I have looked through the documentation, but I can't find anything related to this problem or to encoding in general.