2

I have to convert PDFs to text and currently I am using pdftotext.exe. This messes up the resulting text sometimes and so I can't use that.

Is there another free tool that I can call from another program? I'd prefer a command line tool.

franzlorenzon
  • 5,845
  • 6
  • 36
  • 58
EOB
  • 2,975
  • 17
  • 43
  • 70

3 Answers3

3

PDF can be tricky to convert to Text depending on how its constructed, but you may get good results from iTextSharp or GhostScript or a commercial component eg: from www.tallcomponents.com (not affiliated)

Mark Redman
  • 24,079
  • 20
  • 92
  • 147
1

PDF files do not generally contain any structure so the software needs to guess it. I wrote a blog post on the issues at http://www.jpedal.org/PDFblog/2009/04/pdf-text/

You could also try PdfBox.

mark stephens
  • 3,205
  • 16
  • 19
0

I find that Apache PDFBox is much better than pdftotext. It extracts text in a way that is much closer to the original formatting of the document. It can be run from the command line.

bcoughlan
  • 25,987
  • 18
  • 90
  • 141