How to parse this kind of PDF with python

Question

I am trying to parse the pdf found here: https://corporate.lowes.com/sites/lowes-corp/files/annual-report/lowes-2020ar.pdf with python. It seems to be text-based, according to the copy/paste test, and the first several pages parse just fine using, e.g. pymupdf.

However, after about page 12, there seems to be an internal change in the document encoding. For example, this section from page 18:

It looks like text, but when you copy and paste it, it becomes:

%A>&1;<81
FB9#4AH4EL

%BJ8XF8@C?BL874CCEBK<@4G8?L
9H??G<@84FFB6<4G8F4A7
C4EGG<@84FFB6<4G8F

CE<@4E<?L<AG;8.A<G87,G4G8F4A74A474"A9<F64?
J88KC4A787BHEJBE>9BE68
;<E<A:4FFB6<4G8F<AC4EGG<@8
F84FBA4?
4A79H??G<@8CBF<G<BAFGB9H?9<??G;8F84FBA4?78@4A7B9BHE,CE<A:F84FBA
<A6E84F8778@4A77HE<A:G;8(/"C4A78@<6
4F6HFGB@8EF9B6HF87BA;B@8<@CEBI8@8AGCEB=86GF
4A74A4G<BAJ<78899BEGGB@B7<9LBHEFGBE8?4LBHG

What is going on here? Will I need to use OCR to parse a file like this? Or is there some way of translating that the stuff above back to text?

Ok - I'll just chalk it up to some peculiarity of the file then. I just wanted to make sure that this wasn't some kind of common pdf-encoding that I hadn't seen before. — Fortunato, Aug 20 '21 at 20:38

K J · Accepted Answer · 2021-08-20T22:28:23.460

Pages 13 to 100 have been imported also there are other odd practices thus suggest you will get 12 good pages then need to OCR 13-100 then probably good 3 pages from 101-104 again see https://stackoverflow.com/a/68627207/10802527

The majority of Pages 13-100 contain structured text that is described as Roman, and coincidentally the Romans were fond of encoding messages by sliding the alphabet a few step to the right or left and that's exactly what's happening here by character sliding we could extract much of the corrupted text using chars+n so read

 A and replace with n
 B and replace with o
 C and replace with p

etc. but I will leave it there as I have little time to do 90 pages of analysis on a bad file font definition.

I tried Acrobat and Exchange plus others all agreed the text was defined as a reasonable form of Times Roman thus nothing to fix but content is meaningless nevertheless Selecting the characters for "We" (08) generally jumped to another instance suggesting there could be some slight possibility of redemption but then yet again the same two characters stopped on occasion at "ai" which is what's needed so I would say the file is Borked.

In theory the corruption should be recoverable in the PDF by remapping that font (at least for those pages), and with good Char remapping by adding or subtracting accordingly the plain text may be more easily converted.

How to parse this kind of PDF with python

1 Answers1