0

I want the coordinates of each line in a page of a PDF using PDFBox. I am getting character level information but unable to get line coordinates.

Following is my code:

public class PDFFontExtractor extends PDFTextStripper {

    public PDFFontExtractor() throws IOException {
        super();
    }

    @Override
    protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
        System.out.println(str);
        for (TextPosition text : textPositions) {
            System.out.println(text.getFont().getName());
            System.out.println(text.getFontSizeInPt());
        }
    }

    public static void main(String[] args) {
        File file = new File("/home/neha/Downloads/docs/General.pdf");

        try {
            PDDocument document = PDDocument.load(file);
            PDFFontExtractor textStripper = new PDFFontExtractor();
            textStripper.setSortByPosition(true);   
            textStripper.writeText(document, NullWriter.NULL_WRITER);
        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}
alexrnov
  • 2,346
  • 3
  • 18
  • 34
  • [This answer](https://stackoverflow.com/a/50514745/1729265) shows how to get the coordinates of words. If you don't split at word coordinates like in that answer but instead apply `printWord` to the whole `writeString` parameter `textPositions`, you should get coordinates of the text line. Beware, the coordinates are normalized in the PDFBox specific way... – mkl Jun 16 '18 at 19:52

2 Answers2

0

If you are just seeking text & page/line coordinates of the pdf, you can achieve it this way:

public class PDFFontExtractor extends PDFTextStripper {

    public PDFFontExtractor() throws IOException {
        super();
    }

    public static void main(String[] args) {

        try (PDDocument document = PDDocument.load(new File("/home/neha/Downloads/docs/General.pdf"))) {
            PDFFontExtractor textStripper = new PDFFontExtractor();
            textStripper.setSortByPosition(true);
            for (int page = 1; page <= document.getNumberOfPages(); page++) {
                textStripper.setStartPage(page);
                textStripper.setEndPage(page);
                String pdfFileText = textStripper.getText(document);
                // split by line
                String lines[] = pdfFileText.split("\\n");
                for (int line = 0; line < lines.length; line++) {
                    System.out.println(String.format("Page: %s, Line: %s, Text: %s", page, line, lines[line]));
                }
            }

        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}
Amith Kumar
  • 4,400
  • 1
  • 21
  • 28
0

I'm not sure if that's doable, I looked to the org.apache.pdfbox.text.PDFTextStripper implementation, and I found that org.apache.pdfbox.text.PDFTextStripper#writeLine is private:

 /**
 * Write a list of string containing a whole line of a document.
 * 
 * @param line a list with the words of the given line
 * @throws IOException if something went wrong
 */
private void writeLine(List<WordWithTextPositions> line)
        throws IOException
{
    int numberOfStrings = line.size();
    for (int i = 0; i < numberOfStrings; i++)
    {
        WordWithTextPositions word = line.get(i);
        writeString(word.getText(), word.getTextPositions());
        if (i < numberOfStrings - 1)
        {
            writeWordSeparator();
        }
    }
}

The example in https://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/DrawPrintTextLocations.java?view=markup&sortby=date shows how to fetch the coordinates for a word. If you run the code, you will see the implementation will draw a rectangle on every character. Probably, if someone filled a ticket for Apache to allow us to override that particular method will be great addition.

Ahmad AlMughrabi
  • 1,612
  • 17
  • 28