0

I have used zlib python library to decode stream which were compressed using FlateDecode. Until now, all the pdf files I have worked with, showed correct values in Tj and TJ operators but I am facing issue decoding this pdf as I am not getting what's displayed in the PDF.

I am able to copy text from the PDF to notepad without any issue and also pdftotext is giving expected results with correct words as output.

I have also used Adobe Preflight to see the document's internal structure to double check the decoded text I am getting via zlib but even that shows garbage values and it doesn't match to what's displayed in the PDF.

Why do I get this garbage value in text operators and how is pdftotext still able to get the correct results ?

Also, How do I get correct results via python/zlib ?

PDF File

Fonts details using pdffonts

Decoded stream via zlib(python)

enter image description here

Pawan Sharma
  • 1,842
  • 1
  • 14
  • 18

1 Answers1

0

The values in the TJ/Tj operators are PDF codepoints (normally one byte, sometimes two). You will need to see which font is in operation, then read the font encoding (there are many kinds). PDF text extraction is very hard. I wouldn't advise trying it yourself.

You have been lulled into a false sense of security by seeing PDF files in which the PDF codepoints happen to be exactly the same as the unicode codepoints they represent - i.e you have been looking at files which use simple font encodings.

johnwhitington
  • 2,308
  • 1
  • 16
  • 18
  • 1
    *"PDF codepoints (normally one byte, sometimes two)"* - are you sure? The CMap reference the pdf spec points to (Adobe TN 5014) allows input codes to consist of one, two, three, or more bytes. Admittedly, normally pdfs use single byte, double byte, or mixed single and double byte encodings, but a general purpose pdf processor should be prepared to deal with longer code points, too. – mkl Nov 15 '22 at 18:02
  • @mkl If that's the case, how should I extract text out of PDF then ? Or should I ditch this plan altogether and do it via `pdftotext` ? – Pawan Sharma Nov 16 '22 at 11:03
  • For text extraction it doesn't suffice to only inspect the text showing instructions (**Tj**, **TJ**, ...). You also have to keep an eye on text state setting instructions (like **Tf** for fonts) and keep track of the current text state. When you come across a text showing instruction, you can lookup the details of the current font resource and determine its **Encoding** and **ToUnicode** mapping and use them to interpret the string shown. – mkl Nov 16 '22 at 16:12