4

I'm using Ghostscript to convert a PDF document into an EPS file.

My goal is to remove the textual information (while keeping the vector outlines of the text intact) in the PDF. I am doing so by converting to EPS and then converting it back PDF. (Of course, I don't expect to prevent people from running OCR to get the text.)

The command I used was:

gs -q -dNOCACHE -dNOPAUSE -dBATCH -dSAFER \
   -sDEVICE=epswrite -sOutputFile=output.eps input.pdf

But when I convert the resulting EPS back to PDF, the original margin is mostly lost, the page size shrank, and texts on even-numbered pages are cropped on the right.

Is there a way to keep the original page size and margin during the conversion?

Another tool I tried was ps2eps.

While it supports specifying a page size, it does not actually remove the textual information, so one could still select and copy text from the resulting PDF. This defeats my purpose.

Another drawback is that it only supports converting a single page, so I have to first convert my PDF to a set of single-page PS files using psselect.

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
user31039
  • 6,149
  • 4
  • 14
  • 9

1 Answers1

2

Firstly don't use epswrite (in fact in recent versions of Ghostscript, you can't -- so you must be using an old version, upgrade!). You should be using the eps2write device instead.

Secondly, don't convert PDF->EPS->PDF.

Each conversion costs you accuracy. Doubly don't do this if you intend to maintain page level information (like margins). EPS files are deliberately intended to have a tight bounding box, amongst other requirements which probably make it unsuitable for your purposes.

If you want to maintain the page level data, then convert to PostScript, not EPS, using the ps2write device.

Note that when using the epswrite device, you are not 'removing the textual information (while keeping the vector outlines of the text intact)', but in the general case you are rendering the text to bitmaps. Ugly, and doesn't scale well!

To do this sensibly, use a current version of Ghostscript (9.16), use the pdfwrite device (with PDF in, PDF out) and select the -dNoOutputFonts switch.

This will do what you seem to want: it will draw the text as vectors, not text. The result will, of course, be a PDF file which is unsearchable and immune to copy/paste.

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
KenS
  • 30,202
  • 3
  • 34
  • 51
  • Thanks for your advice. I zoomed in on the PDF converted from EPS and it is definitely vector curves, not bitmaps. I upgraded my GhostScript and tried your PDF-in PDF-out method. It worked nicely, but produced a much larger PDF, presumably because glyphs were not reused. My original PDF is 50 KB, the PDF via EPS is 20 KB, and now the direct PDF is 800 KB. Curiously, the shapes of glyphs seem to be reused in EPS, which would explain the small file size. – user31039 Apr 06 '15 at 16:42
  • I found myself unable to reproduce the PDF->EPS->PDF procedure now. So this answer is the only working method left. – user31039 Apr 06 '15 at 16:54
  • The only way the shapes would be 'reused' is if they were stored as glyph descriptions in a type 3 font. This is possible, there are cases in which epswrite will produce vector glyph descriptions instead of bitmaps, but in general it produces bitmaps. Note that if whether it produces bitmaps or vectors, it will do them as a type 3 font, so it still holds textual information. The pdfwrite method removes all textual information totally, epswrite doesn't though it will (again, generally, not 100% of the time) re-encode the text so that it is useless for copy/paste/search. – KenS Apr 07 '15 at 06:59