7

What is the easiest way to get the text (words) of a PDF file as one long String or array of Strings.

I have tried pdfbox but that is not working for me.

Qantas 94 Heavy
  • 15,750
  • 31
  • 68
  • 83
Ankur
  • 50,282
  • 110
  • 242
  • 312
  • 1
    What about pdfbox didn't work? Are you looking for alternatives or a fix for your existing problem? – Catchwa Nov 05 '09 at 05:11
  • Well I didn't like the API was designed either, I have had a quick look at iText and I think that is a better option. The API seems more logical for my needs. – Ankur Nov 05 '09 at 06:24

4 Answers4

4

use iText. The following snippet for example will extract the text.

PdfTextExtractor parser =new PdfTextExtractor(new PdfReader("C:/Text.pdf"));
parser.getTextFromPage(3);

Kushal Paudyal
  • 3,571
  • 4
  • 22
  • 30
3

PDFBox barfs on many newer PDFs, especially those with embedded PNG images.

I was very impressed with PDFTextStream

Sam Barnum
  • 10,559
  • 3
  • 54
  • 60
1

JPedal and Multivalent also offer text extraction in Java or you could access xpdf using Runtime.exec

RAS
  • 8,100
  • 16
  • 64
  • 86
mark stephens
  • 449
  • 3
  • 2
0

Well, i have used Tika in order to extract raw text from pdf(it is based on PDFBox), but i think Tika is useful only when you have to extract text from different file formats(auto detection helps a lot).

If you want to parse only pdf's into text i would suggest PDFTextStream because it's a much better parser than other apis(such as iText and PDFBox).

With PDFTextStream you can easily get structured text (pages->blocks->lines->textUnits), and it gives you the possibility to extract correlated info such as character encoding, height, location of a character in the page etc..

Example:

public class ExtractTextAllPages {
    public static void main (String[] args) throws IOException {
        String pdfFilePath = args[0];
        PDFTextStream pdfts = new PDFTextStream(pdfFilePath); 
        StringBuilder text = new StringBuilder(1024);
        pdfts.pipe(new OutputTarget(text));
        pdfts.close();
        System.out.printf("The text extracted from %s is:", pdfFilePath);
        System.out.println(text);
    }
}
yeaaaahhhh..hamf hamf
  • 746
  • 2
  • 13
  • 34