24

Last year, I made an application in Java using PDFBox to get the raw text in some PDF files and I need to port that application to C++ now.

I wanted to know what was the best C++ alternative to accomplish what I need.

I'll give an example in case it helps:

Most files will look like this: http://www.jumbala.net/backup/league.pdf

With PDFBox, using that file, each line read on page 2 and most of page 3 would output all the data of a line, separated by a space instead of keeping it in a grid like it is now.

So the first relevant line in page 2 would look like this:

FB 847 - Tremblay, Gérard 179,63 56 16167 90 268 s27 p3 669 s14 199 223 193 615

or something like that since there are minor changes in the order they appear, but I don't care about that as long as similar lines output the same since I just parse them and put the values I need in different variables.

So, knowing all of that, is there a library that I can use in a C++ program to get similar results?

Edit: After looking at sacredFaith's link at http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file and trying it, I'm getting a weird output like such for the example file I mentioned earlier:

http://www.jumbala.net/backup/league.pdf.txt

The parts I actually need are in the weird characters at the beginning. Using Adobe Acrobat Reader X and using Save As... Text (accessible), I get the following result:

http://www.jumbala.net/backup/league_good.pdf.txt

Which is approximately what I get in Java using PDFBox and what I want to get as output in C++.

Jumbala
  • 4,764
  • 9
  • 45
  • 65
  • Maybe this can help http://stackoverflow.com/questions/3784554/creating-a-pdf-reader-in-c – grifos Mar 30 '12 at 23:08
  • @grifos I looked at it and I might look at it a little more in detail later, but I'd rather have an already made library since I'd prefer not having to read through the whole PDF specifications document. Great link you posted, though, it might come in handy later, thanks! – Jumbala Mar 31 '12 at 14:27
  • In the link they also takl about a c++ library PoDoFo, that allow you to parse pdf and extract info. – grifos Mar 31 '12 at 15:14
  • @grifos I hadn't noticed, thanks! – Jumbala Mar 31 '12 at 15:14

3 Answers3

11

Xpdf is a C++ application/library which includes tools to extract plain text from a PDF file.

Charles Salvia
  • 52,325
  • 13
  • 128
  • 140
  • 5
    I just downloaded the precompiled version of Xpdf and the .exe from the command line works great, I get the output I want (and even better than using PDFBox if I use the -layout option). I have a question, though... Is there a place where I can see how to call the methods in code instead of using the .exe? I'll look on my own, but since you seem to be familiar with the library it would be even better if you could tell me where to start looking. Thanks a lot! – Jumbala Mar 31 '12 at 14:37
  • XPDF team provides commercial versions of their libraries along with optional support at http://www.glyphandcog.com/XpdfText.html – Eugene Feb 24 '15 at 11:47
3

Since that's what your looking for : PoDoFo is C++ library to parse/read/modify or create pdf files. The library is cross-platform.

grifos
  • 3,321
  • 1
  • 16
  • 14
2

I've never used the following, but after some Googling I found this:

http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file

Uli Köhler
  • 13,012
  • 16
  • 70
  • 120
sacredfaith
  • 850
  • 1
  • 8
  • 22
  • I'll take a look at it, thanks! I'll mark your answer as accepted if I can get it working the way I want! – Jumbala Mar 30 '12 at 23:21
  • Unfortunately, I just tried it and it doesn't work the way I want it to (some parts of the text extract fine, but most of the document is made of weird symbols) – Jumbala Mar 31 '12 at 13:39
  • 1
    Sorry about that man! Looks like you found what you were looking for thanks to Charles! – sacredfaith Apr 02 '12 at 19:12