Get line coordinates of pdf using PDFBox java

Question

I want the coordinates of each line in a page of a PDF using PDFBox. I am getting character level information but unable to get line coordinates.

Following is my code:

public class PDFFontExtractor extends PDFTextStripper {

    public PDFFontExtractor() throws IOException {
        super();
    }

    @Override
    protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
        System.out.println(str);
        for (TextPosition text : textPositions) {
            System.out.println(text.getFont().getName());
            System.out.println(text.getFontSizeInPt());
        }
    }

    public static void main(String[] args) {
        File file = new File("/home/neha/Downloads/docs/General.pdf");

        try {
            PDDocument document = PDDocument.load(file);
            PDFFontExtractor textStripper = new PDFFontExtractor();
            textStripper.setSortByPosition(true);   
            textStripper.writeText(document, NullWriter.NULL_WRITER);
        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}

[This answer](https://stackoverflow.com/a/50514745/1729265) shows how to get the coordinates of words. If you don't split at word coordinates like in that answer but instead apply `printWord` to the whole `writeString` parameter `textPositions`, you should get coordinates of the text line. Beware, the coordinates are normalized in the PDFBox specific way... — mkl, Jun 16 '18 at 19:52

score 0 · Answer 1 · answered Jun 16 '18 at 18:18

If you are just seeking text & page/line coordinates of the pdf, you can achieve it this way:

public class PDFFontExtractor extends PDFTextStripper {

    public PDFFontExtractor() throws IOException {
        super();
    }

    public static void main(String[] args) {

        try (PDDocument document = PDDocument.load(new File("/home/neha/Downloads/docs/General.pdf"))) {
            PDFFontExtractor textStripper = new PDFFontExtractor();
            textStripper.setSortByPosition(true);
            for (int page = 1; page <= document.getNumberOfPages(); page++) {
                textStripper.setStartPage(page);
                textStripper.setEndPage(page);
                String pdfFileText = textStripper.getText(document);
                // split by line
                String lines[] = pdfFileText.split("\\n");
                for (int line = 0; line < lines.length; line++) {
                    System.out.println(String.format("Page: %s, Line: %s, Text: %s", page, line, lines[line]));
                }
            }

        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}

I want the exact coordinates of each line (top_left_x,top_left_y,bottom_right_x,bottom_right_y) — Neha Pandey, Jun 16 '18 at 18:59
And what are you trying to do with that ? Its easier to figure out a solution that way. — Amith Kumar, Jun 16 '18 at 19:39

score 0 · Answer 2 · answered Aug 05 '20 at 11:03

I'm not sure if that's doable, I looked to the org.apache.pdfbox.text.PDFTextStripper implementation, and I found that org.apache.pdfbox.text.PDFTextStripper#writeLine is private:

 /**
 * Write a list of string containing a whole line of a document.
 * 
 * @param line a list with the words of the given line
 * @throws IOException if something went wrong
 */
private void writeLine(List<WordWithTextPositions> line)
        throws IOException
{
    int numberOfStrings = line.size();
    for (int i = 0; i < numberOfStrings; i++)
    {
        WordWithTextPositions word = line.get(i);
        writeString(word.getText(), word.getTextPositions());
        if (i < numberOfStrings - 1)
        {
            writeWordSeparator();
        }
    }
}

The example in https://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/DrawPrintTextLocations.java?view=markup&sortby=date shows how to fetch the coordinates for a word. If you run the code, you will see the implementation will draw a rectangle on every character. Probably, if someone filled a ticket for Apache to allow us to override that particular method will be great addition.

Get line coordinates of pdf using PDFBox java

2 Answers2