I have to convert PDFs to text and currently I am using pdftotext.exe
. This messes up the resulting text sometimes and so I can't use that.
Is there another free tool that I can call from another program? I'd prefer a command line tool.
I have to convert PDFs to text and currently I am using pdftotext.exe
. This messes up the resulting text sometimes and so I can't use that.
Is there another free tool that I can call from another program? I'd prefer a command line tool.
PDF can be tricky to convert to Text depending on how its constructed, but you may get good results from iTextSharp or GhostScript or a commercial component eg: from www.tallcomponents.com (not affiliated)
PDF files do not generally contain any structure so the software needs to guess it. I wrote a blog post on the issues at http://www.jpedal.org/PDFblog/2009/04/pdf-text/
You could also try PdfBox.
I find that Apache PDFBox is much better than pdftotext. It extracts text in a way that is much closer to the original formatting of the document. It can be run from the command line.