5

I'm using "pdftops" to convert .pdf files to .ps files and then "ps2pdf" for the reverse process (poppler-utils). The problem is that when creating the .pdf files from the .ps files, the text looks ok, but when i try to copy it, the characters are very strange (it's like they are corrupted). I used these tools on other files for a long time and it worked fine. I also tried "pdftohtml -xml" to create an .xml file, and the text is ok (the characters are extracted correctly).

  1. What problem could it be regarding the conversion? Maybe if I use "pdftops" and "ps2pdf" are there some options that need to be changed?
  2. If I create the .xml output, is there a way to create a .pdf file from the .xml file ?

EDIT: Output for "pdffonts original.pdf" pdffonts_output_originalpdf

Output for "roundtripped.pdf" pdffonts_output_roundtrippedpdf

Andrei F
  • 4,205
  • 9
  • 35
  • 66

1 Answers1

7

I'm just covering the PS->PDF conversion... (I'm assuming your phrase of vice-versa isn't meant to point to a 'round-trip' conversion of the very same file [PDF->PS->PDF], but the general direction of conversion for any PS file. Is that correct?)

First of all, most likely your ps2pdf is only a shellscript, which internally uses a Ghostscript command with some default parameters to do the real work. ps2pdf is much easier to use. Ghostscript has many more options, but it is more difficult to learn. ps2pdf it takes away a lot of potential control you could have if you used Ghostscript. (You can tweak a few parameters with ps2pdf -- but then you are already so much closer to run the real Ghostscript command already...)

Second, without exactly knowing how exactly your PS input file is conditioned, it is difficult to give you good advice: Does your PS have embedded the fonts it uses? Which type of fonts are they? etc.

Thirdly, Ghostscript gained a lot of additional power and control, and had a few bugs or weak spots removed over the last few years when it comes to outputing PDF. So, which is the version of Ghostscript installed on your system? (Remember, ps2pdf calls Ghostscript, it will not work without a locally installed gs executable.)

One likely cause for your inability to copy text from the PDF could be the font type (and encoding) that ended up being used and embedded in your PDF file. Which font details can you tell us about your resulting PDFs? (Try pdffonts your.pdf to find out -- pdffonts is also part of the Poppler utils you mentioned.)

You may try this (full) Ghostscript command for PS->PDF conversion and check where it takes you:

gs \
  -o output.pdf \
  -sDEVICE=pdfwrite \
  -dPDFSETTINGS=/prepress \
  -dHaveTrueTypes=true \
  -dEmbedAllFonts=true \
  -dSubsetFonts=false \
  -c ".setpdfwrite <</NeverEmbed [ ]>> setdistillerparams" \
  -f input.ps
Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • For the first question: i tried a round-trip conversion for the same file: PDF->PS->PDF and i got negative results: the text is displayed correctly but when i try to copy it, i get corrupt characters. I also tried your "gs" command and i got the same results. – Andrei F May 28 '12 at 14:37
  • @ice13ill: What about the other questions? Try `pdffonts original.pdf` + `pdffonts roundtripped.pdf` and report results. – Kurt Pfeifle May 28 '12 at 19:00
  • @ice13ill: *Which version of Ghostscript is installed on your system?* – Kurt Pfeifle May 29 '12 at 07:53
  • @ice13ill: v8.71 is rather old... Can you provide a sample of your original PDF (so I can try and find a better way)? – Kurt Pfeifle May 29 '12 at 08:38
  • A colleague of mine run the same tests with version 9 of gs. Same results :(. I will try to send you a part of a document that behaves similarly to those on which I have performed the tests. – Andrei F May 29 '12 at 09:22
  • Can you provide a private email address please? – Andrei F May 30 '12 at 13:37
  • @ice13ill: you may use 'my nickname' AT 'gmail' DOT 'com'. – Kurt Pfeifle May 31 '12 at 04:17
  • I've sent you an email. Have you looked over it ? – Andrei F Jun 05 '12 at 09:57