PDF: obfuscating text encoding to prevent automatic parsing and copy+paste

Question

I want to make a PDF available on my website, but want to prevent the automatic parsing by bots that might not respect the normal PDF "security". The reason is that this is also commercially published and I am allowed to share for "personal use", but must not make it widely available that way. I originally created the PDF from Word.

I have tried using Ghostscript with the dNoOutputFonts option to convert text to glyphs, but the result is ridiculously big (from 2.5 MB to 180 MB). Scrambling the text encoding seems a good option, but I barely found any posts discussing this. There seems to be a commercial solution, but I was unable to find a way to do this e.g. using Ghostscript or qpdf. Any suggestion on how to achieve this (or alternative solutions)?

Operating system: Windows 10 64bit Available versions of Ghostscript: 9.18, 9.27

Simple example PDF

Export the PDF with a JPEG image as the only content on it. There's still OCR tools that may help, but they aren't 100% perfect and will probably have formatting issues. — Ismael Miguel, Aug 23 '19 at 17:17

KenS · Answer 1 · 2019-08-23T20:00:08.733

Well, that's the advantage of fonts, you only have to describe each character once. Convert to outlines and you need to describe it every time, so yeah, much bigger.

Ghostscript's pdfwrite device goes to considerable effort to try and make text searchable, because in general people shout at us when a 'searchable' file becomes 'non-searchable'. So (amongst other things) it preserves any ToUnicode CMaps in the input file. To prevent simple indexing you need to avoid that. You haven't linked to a PDF file so I can't test this, but....

There are three places you need to edit:

/ghostpdl/Resource/Init/gs_pdfwr.ps, line 642, change:

/WantsToUnicode /GetDeviceParam .special_op {
  exch pop
}{
  //true
}ifelse

To:

//false

In the same file, at line 982, change:

  /WantsToUnicode /GetDeviceParam .special_op {
    exch pop
  }{
    //false
  }ifelse

To:

//false

Then in /ghostpdl/Resource/Init/pdf_font.ps, line 614, change:

/WantsToUnicode /GetDeviceParam .special_op { exch pop }{ //false }ifelse

To:

//false

That should prevent any ToUnicode information in the inptu file making it through to the output file. Depending on the Operating System you are using, and the way Ghostscript has been built (you haven't said), you may need to tell Ghostscript to include that directory in its search path, which you do with -I/ghostpdl/Resource/Init.

You should also set -dSubsetFonts=true, that will emit all fonts as subsets, I think that's the default but I can't immediately recall and it does no harm to set it. That means the first glyph that is encountered is encodesd at index 1, the second at index 2 etc. So Hello World becomes 0x01, 0x02, 0x03, 0x03, 0x04, 0x05, 0x06, 0x04, 0x07, 0x03, 0x08. The ordering wil be consistent throughout the file (obviously) but different for every font in the file and for every file. That should be adequately scrambled I'd have thought. It certainly won't be possible to search/copy/paste trivially.

If you make an example file available I can test it.

Oh, it also just occured to me that you might be able to get the same effect by using the ps2write device to create a PostScript file, then using the pdfwrite device to convert that back to PDF. The ps2write device can't embed ToUnicode CMaps, because there's no standard support in PostScript for that. Of course, it also means the content drops back to PostScript, which may result in other, unacceptable, quality/size chanegs.

Thank you @KenS for this comprehensive answer. It looks like this might require compiling from source, so it might take me a day or two to understand how to do what you are describing (I use Win10 64bit). Regarding your final paragraph: I just tried the following on Ghostscript 9.18 and it preserved the character encoding. `gswin64c.exe -sDEVICE=ps2write -sOutputFile=- -q -dbatch -dNOPAUSE -dQUIET in.pdf -c quit > convert.ps` and `ps2pdf.bat convert.ps out.pdf` — Rob Hall, Aug 23 '19 at 20:42
You don't have to recompile (unless you really want to), teh PostScript resource files are read at startup, either from the ROM file system or form disk, depending how GS is built. If its using a ROM file system then you can tell it to use the disk-based ones by using -I. You would only need to recompile if you wanted to change the ROM file system, I'd reccomend you don;t do that, just keep 2 copies of the Resources; one for regular use and one for 'scrambling', and switch as required. The Windows binary **does** use a ROM file system, so you would need to use -I. — KenS, Aug 24 '19 at 08:50
You also need to use a more recent version, one that old doesn't ship with the Resource on disk as I recall. The current version is 9.27, 9.28 will be released 'soon'. I'd need to see an example file to figure out what's going on with the PostScript route. Its possible you simply are using an ASCII Encoding, that wil work with Acrobat even when there's no ToUnicode CMap. NB you'll need to djust the path for -I appropriately for your system, and if it includes spaces, surround it with "". — KenS, Aug 24 '19 at 08:51

PDF: obfuscating text encoding to prevent automatic parsing and copy+paste

1 Answers1