From PDf to String

Question

What is the easiest way to get the text (words) of a PDF file as one long String or array of Strings.

I have tried pdfbox but that is not working for me.

What about pdfbox didn't work? Are you looking for alternatives or a fix for your existing problem? — Catchwa, Nov 05 '09 at 05:11
Well I didn't like the API was designed either, I have had a quick look at iText and I think that is a better option. The API seems more logical for my needs. — Ankur, Nov 05 '09 at 06:24

score 4 · Answer 1 · answered Nov 05 '09 at 16:29

4

use iText. The following snippet for example will extract the text.

PdfTextExtractor parser =new PdfTextExtractor(new PdfReader("C:/Text.pdf"));
parser.getTextFromPage(3);

answered Nov 05 '09 at 16:29

Kushal Paudyal

3,571
4
22
30

Part of [OpenPDF](https://github.com/LibrePDF/OpenPDF). – Sander Verhagen Aug 02 '22 at 08:04

score 3 · Answer 2 · answered Nov 05 '09 at 15:53

3

PDFBox barfs on many newer PDFs, especially those with embedded PNG images.

I was very impressed with PDFTextStream

answered Nov 05 '09 at 15:53

Sam Barnum

10,559
3
54
60

score 1 · Answer 3 · edited Dec 16 '13 at 08:29

1

JPedal and Multivalent also offer text extraction in Java or you could access xpdf using Runtime.exec

edited Dec 16 '13 at 08:29

RAS

8,100
16
64
86

answered Nov 05 '09 at 07:44

mark stephens

449
3
2

score 0 · Answer 4 · answered Feb 24 '14 at 12:12

Well, i have used Tika in order to extract raw text from pdf(it is based on PDFBox), but i think Tika is useful only when you have to extract text from different file formats(auto detection helps a lot).

If you want to parse only pdf's into text i would suggest PDFTextStream because it's a much better parser than other apis(such as iText and PDFBox).

With PDFTextStream you can easily get structured text (pages->blocks->lines->textUnits), and it gives you the possibility to extract correlated info such as character encoding, height, location of a character in the page etc..

Example:

public class ExtractTextAllPages {
    public static void main (String[] args) throws IOException {
        String pdfFilePath = args[0];
        PDFTextStream pdfts = new PDFTextStream(pdfFilePath); 
        StringBuilder text = new StringBuilder(1024);
        pdfts.pipe(new OutputTarget(text));
        pdfts.close();
        System.out.printf("The text extracted from %s is:", pdfFilePath);
        System.out.println(text);
    }
}

From PDf to String

4 Answers4