0

I'm trying to convert pdfs to text files. I use this command to perform the conversion:

gs -dBATCH -dNOPAUSE -sDEVICE=txtwrite -sOutputFile=output.txt input.pdf

Ghostscript version is 9.07.

I get all the text shown in PDF. I'd like to preserve the blank lines in the text file if possible.

Thanks

Will
  • 1,718
  • 3
  • 15
  • 23
  • Typically, a PDF *has* no blank lines. Even the notion of a 'line' is fairly broad - the specifications allow for a sequence of text to be emitted on a single line, but it also allows "any" x and y position for any text. For such an objective, you must compare the y position of each 'line' and decide whether or not the distance is far enough apart to count as "blank". – Jongware Mar 21 '16 at 01:15

1 Answers1

1

You should upgrade, the current version of Ghostscript is 9.18 and 9.19 will be released very shortly. Each of the interim versions includes fixes to the txtwrite device.

Although it is true that PDF files do not include blank lines, the txtwrite device does have a mode whereby it will attempt to produce a reasonable representation of the original layout by using spaces and blank lines in a text file.

This is the default action in the current version of txtwrite, so you ought to be getting this already, unless you have selected a different TextFormat.

This mode is highly heuristic, easily fooled, doesn't cope well with superscripts, subscripts, significant point size changes and possibly other attributes which make the layout difficult to reproduce. Obviously without seeing your input file, there's nothing more I can tell you.

KenS
  • 30,202
  • 3
  • 34
  • 51