Extract all text with string positions from a PDF

Question

This may seem an old question, but I didn't find an exhaustive answer after spending half an hour searching all over SO.

I am using PDFBox and I would like to extract all of the text from a PDF file along with the coordinates of each string. I am using their PrintTextLocations example (http://pdfbox.apache.org/apidocs/org/apache/pdfbox/examples/util/PrintTextLocations.html) but with the kind of pdf I am using (E-Tickets) the program fails to recognize strings, printing each character separately. The output is a list of strings (each representing a TextPosition object) like this:

String[414.93896,637.2442 fs=1.0 xscale=8.0 height=4.94 space=2.2240002 width=4.0] s
String[418.93896,637.2442 fs=1.0 xscale=8.0 height=4.94 space=2.2240002 width=4.447998] a
String[423.38696,637.2442 fs=1.0 xscale=8.0 height=4.94 space=2.2240002 width=1.776001] l
String[425.16296,637.2442 fs=1.0 xscale=8.0 height=4.94 space=2.2240002 width=4.447998] e

While I would like the program to recognize the string "sale" as an unique TextPosition and give me its position. I also tried to play with the setSpacingTolerance() and setAverageCharacterTolerance() PDFTextStripper methods, setting different values above and under the standard values (which FYI are 0.5 and 0.3 respectively), but the output didn't change at all. Where am I going wrong? Thanks in advance.

Ah, the joys of PDF. Depending on what created it it could well be that »text« is just a collection of glyphs at certain positions so you'd have to do guesswork based on the positions to figure out where words and spaces are. — Joey, Apr 02 '12 at 12:09

score 4 · Accepted Answer · answered Jun 04 '12 at 10:02

4

As Joey mentioned, PDF is just a collection of instructions telling you where a certain character should be printed.

In order to extract words or lines, you will have to perform some data segmentation: studying the bounding boxes of the characters should let you recognize those that are on a same line and then which one form words.

answered Jun 04 '12 at 10:02

Nicolas W.

642
6
20

1

Thanks for your response. This is what I ended up doing: creating a set of rectangles for each PDF "template" and apply that to extract portions of text based on the position. This will require a lot of manual work to be maintained, but it seems the only reliable approach. – Andrea Sprega Jun 04 '12 at 13:31

Yaniv Levy · Answer 2 · 2020-02-27T12:15:41.640

Here is your Solution: 1. Reading File 2. Fetching Each Page to Text by using PDFParserTextStripper 3. Each Position of the text will be printed by char.

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.List;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
class PDFParserTextStripper extends PDFTextStripper {
    public PDFParserTextStripper(PDDocument pdd) throws IOException {
        super();
        document = pdd;
    }
    public void stripPage(int pageNr) throws IOException {
        this.setStartPage(pageNr + 1);
        this.setEndPage(pageNr + 1);
        Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
        writeText(document, dummy); // This call starts the parsing process and calls writeString repeatedly.
    }
    @Override
    protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
        for (TextPosition text : textPositions) {
            System.out.println("String[" + text.getXDirAdj() + "," + text.getYDirAdj() + " fs=" + text.getFontSizeInPt()
                    + " xscale=" + text.getXScale() + " height=" + text.getHeightDir() + " space="
                    + text.getWidthOfSpace() + " width=" + text.getWidthDirAdj() + " ] " + text.getUnicode());
        }
    }
    public static void extractText(InputStream inputStream) {
        PDDocument pdd = null;
        try {
            pdd = PDDocument.load(inputStream);
            PDFParserTextStripper stripper = new PDFParserTextStripper(pdd);
            stripper.setSortByPosition(true);
            for (int i = 0; i < pdd.getNumberOfPages(); i++) {
                stripper.stripPage(i);
            }
        } catch (IOException e) {
            // throw error
        } finally {
            if (pdd != null) {
                try {
                    pdd.close();
                } catch (IOException e) {
                }
            }
        }
    }
    public static void main(String[] args) throws IOException {
        File f = new File("C://PDFLOCATION//target.pdf");
        FileInputStream fis = null;
        try {
            fis = new FileInputStream(f);
            extractText(fis);
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                if (fis != null)
                    fis.close();
            } catch (IOException ex) {
                ex.printStackTrace();
            }
        }
    }
}

Please add a explanation, how you came to this result. – Julian Feb 26 '20 at 11:27 — Julian, Feb 26 '20 at 11:27
That code looks similar to the PrintTextLocations example. – Tilman Hausherr Feb 26 '20 at 13:03 — Tilman Hausherr, Feb 26 '20 at 13:03
Exactly the same concept. – Yaniv Levy Apr 30 '20 at 16:12 — Yaniv Levy, Apr 30 '20 at 16:12

Extract all text with string positions from a PDF

2 Answers2