3

In a project I'm working on we scrape legal documents from various government sites and then make them searchable online.

Every now and then we encounter a PDF that seems to be corrupt. Here's an example of one.

If you open it in a PDF reader, it looks fine, but:

  • If you try to copy and paste it, you get corrupted text
  • If you run it through any tools like pdftotext, you corrupted text
  • If you do just about anything else to it -- you guessed it -- you get corrupted text

Yet, if you open it in a reader, it looks fine! So I know the text is there, but something is wrong, wrong wrong! The result is that on my site it looks really bad.

Is there anything I can do?

Update: I did more research today. Thanks to @Andrew Cash's observation that this is essentially a Caesar cipher, I realized I could search for the documents. This link will show you about 200 of these in my system. Looking through the larger sample set, it looks like these are all created by the same software, pdffactory v. 3.51! So I blame a bug, not deliberate obfuscation.

Update 2: The link above won't provide any results anymore. These are purged from my system using my solution below.

mlissner
  • 17,359
  • 18
  • 106
  • 169
  • There are many ways to "obfuscate" PDFs, but if they render correctly on-screen, you should be able to google "PDF OCR" to find a product that will just render them and convert them back to text. Other option is converting them to images with for example Ghostscript and use pretty much any OCR software. – Joachim Isaksson Feb 10 '12 at 07:31
  • I would suggest using an OCR program. That way, you won't be trying to read a potentially corrupt document. – Cody Gray - on strike Feb 10 '12 at 07:32

2 Answers2

3

Tha PDF is using subsetted fonts where the characters are remapped to other characters using the same as a simple World War II substitution cipher.

A = G, B = 1, C = #, D = W, ... ... and so on. Every character is remapped.

The font is mapped this way and in order to get the correct characters displaying in the PDF you need to send "G1#W" in for it to print out ABCD. Normally PDF's will have a ToUnicode table to help you with text extraction but this table has been left out on purpose I suspect.

I have seen a few of these documents myself where they are deliberately obfuscated to prevent text extraction. I have seen a document with about 5 different fonts and they were all mapped using a different sequence.

One sure way to tell if this is the problem is to load the PDF into Acrobat and copy / paste the text into a text editor. If Acrobat cannot decode the text back to English then there is no way to extract the text without remapping it manually if you know the translation mappings.

The only way to extract text easily from these types of documents is to OCR the full document and remove the original text. The OCR would convert the page to a TIFF image and then OCR it so the original garbled text shouldn't affect the OCR.

Andrew Cash
  • 2,321
  • 1
  • 17
  • 11
  • Didn't expect a caesar cipher to be the heart of this. Doesn't look deliberate though - see my update. – mlissner Feb 10 '12 at 19:28
  • 1
    Hahaha.... Your explanation of the technical side is quite correct (and honest compliments: it's also easy to understand for PDF-noobs). But what makes me smile is that you are making up a conspiracy theory about "deliberate obfuscation" for what simply is the (well documented) "custom encoding" of a font that gets sub-setted (because one doesn't want to embed the full font because of license or space considerations). ;-) – Kurt Pfeifle Feb 11 '12 at 12:21
2

Weary of this issue and not wanting to deal with OCR, I manually sorted out the cipher. Here she be, as a python dict along with some rudimentary code that I was using to test it. I'm sure this could be improved, but it does work for all letters except uppercase Q and uppercase X, which I haven't yet been able to find.

It's missing a fair bit of punctuation too at least for now (all of these are missing, for example: <>?{}\|!~`@#$%^_=+).

# -*- coding: utf-8 -*-

import re
import sys

letter_map = {
 u'¿':'a',
 u'regex':'b',
 u'regex':'c',
 u'regex':'d',
 u'»':'e',
 u'o':'f',
 u'1':'g',
 u'regex':'h',
 u'·':'i',
 u'¶':'j',
 u'μ':'k',
 u'regex':'l',
 u'3':'m',
 u'2':'n',
 u'±':'o',
 u'°':'p',
 u'regex':'q',
 u'®':'r',
 u'-':'s',
 u'¬':'t',
 u'«':'u',
 u'a':'v',
 u'©':'w',
 u'regex':'x',
 u'§':'y',
 u'¦':'z',
 u'ß':'A',
 u'Þ':'B',
 u'Ý':'C',
 u'Ü':'D',
 u'Û':'E',
 u'Ú':'F',
 u'Ù':'G',
 u'Ø':'H',
 u'×':'I',
 u'Ö':'J',
 u'Õ':'K',
 u'Ô':'L',
 u'Ó':'M',
 u'Ò':'N',
 u'Ñ':'O',
 u'Ð':'P',
 u'':'Q', # Missing
 u'Î':'R',
 u'Í':'S',
 u'Ì':'T',
 u'Ë':'U',
 u'Ê':'V',
 u'É':'W',
 u'':'X', # Missing
 u'Ç':'Y',
 u'Æ':'Z',
 u'ð':'0',
 u'ï':'1',
 u'î':'2',
 u'í':'3',
 u'ì':'4',
 u'ë':'5',
 u'ê':'6',
 u'é':'7',
 u'è':'8',
 u'ç':'9',
 u'ò':'.',
 u'ô':',',
 u'æ':':',
 u'å':';',
 u'Ž':"'",
 u'•':"'",
 u'•':"'", # s/b double quote, but identical to single.
 u'Œ':"'", # s/b double quote, but identical to single.
 u'ó':'-', # dash
 u'Š':'-', # n-dash
 u'‰':'--', # em-dash
 u'ú':'&',
 u'ö':'*',
 u'ñ':'/',
 u'÷':')',
 u'ø':'(',
 u'Å':'[',
 u'Ã':']',
 u'‹':'•',
 }

ciphertext = u'''YOUR STUFF HERE'''

plaintext = ''

for letter in ciphertext:
    try:
        plaintext += letter_map[letter]
    except KeyError:
        plaintext += letter

# These are multi-length replacements
plaintext = re.sub(u'm⁄4', 'b', plaintext)
plaintext = re.sub(u'g⁄n', 'c', plaintext)
plaintext = re.sub(u'g⁄4', 'd', plaintext)
plaintext = re.sub(u' ́', 'l', plaintext)
plaintext = re.sub(u' ̧', 'h', plaintext)
plaintext = re.sub(u' ̈', 'x', plaintext)
plaintext = re.sub(u' ̄u', 'qu', plaintext)

for letter in plaintext:
    try:
        sys.stdout.write(letter)
    except UnicodeEncodeError:
        continue
mlissner
  • 17,359
  • 18
  • 106
  • 169
  • You will find that this might work for this particular PDF but in future you will see a different coding scheme and in some documents each font will use a different random encoding. – Andrew Cash Feb 11 '12 at 02:28
  • I don't know, I used this for 200 pdfs, and did some spot checking. Seemed to work perfectly for whatever reason. – mlissner Feb 11 '12 at 04:32
  • As you say, all your problem PDF's come from the same producer - good to hear that this solves your problem. – Andrew Cash Feb 12 '12 at 14:51