Could someone give me an example of how to extract coordinates for a 'word' using PDFBox

Question

Could someone give me an example of how to extract coordinates for a 'word' with PDFBox

I am using this link to extract positions of individual characters: https://www.tutorialkart.com/pdfbox/how-to-extract-coordinates-or-position-of-characters-in-pdf/

I am using this link to extract words: https://www.tutorialkart.com/pdfbox/extract-words-from-pdf-document/

I am stuck getting coordinates for whole words.

*"I am stuck getting coordinates for whole words."* - What have you tried? — mkl, May 14 '18 at 21:41
Thanks for your insightful help @mkl... I have extracted words by themselves and characters individually with coordinates.. I am simply, as I stated, asking for an example of how to extract 'words with their coordinates'. — GoodJuJu, May 15 '18 at 21:56
Which kinds of word coordinates do you want? Those of the combined bounding box of the individual character bounding boxes the word consists of? Or something else? And in which coordinate system? The same coordinate system returned by your tutorial? — mkl, May 17 '18 at 11:50
Thanks @mkl, I am after the 'bounding box' for the word. I don't really mind which coordinate system as long as it is uniform. The coordinate system in the tutorial would work for me. — GoodJuJu, May 17 '18 at 21:47
@mkl Thank you for taking so much time to give a great working example. I can't test is just yet as I am setting up my Eclipse environment (I was originally using Visual Studio and converting to .NET). I am new to Eclipse and Java programming, but will respond as soon as I can.. — GoodJuJu, Jun 01 '18 at 09:27

score 6 · Accepted Answer · answered May 24 '18 at 17:09

You can extract the coordinates of words by collecting all the TextPosition objects building a word and combining their bounding boxes.

Implementing this along the lines of the two tutorials you referenced, you can extend PDFTextStripper like this:

public class GetWordLocationAndSize extends PDFTextStripper {
    public GetWordLocationAndSize() throws IOException {
    }

    @Override
    protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
        String wordSeparator = getWordSeparator();
        List<TextPosition> word = new ArrayList<>();
        for (TextPosition text : textPositions) {
            String thisChar = text.getUnicode();
            if (thisChar != null) {
                if (thisChar.length() >= 1) {
                    if (!thisChar.equals(wordSeparator)) {
                        word.add(text);
                    } else if (!word.isEmpty()) {
                        printWord(word);
                        word.clear();
                    }
                }
            }
        }
        if (!word.isEmpty()) {
            printWord(word);
            word.clear();
        }
    }

    void printWord(List<TextPosition> word) {
        Rectangle2D boundingBox = null;
        StringBuilder builder = new StringBuilder();
        for (TextPosition text : word) {
            Rectangle2D box = new Rectangle2D.Float(text.getXDirAdj(), text.getYDirAdj(), text.getWidthDirAdj(), text.getHeightDir());
            if (boundingBox == null)
                boundingBox = box;
            else
                boundingBox.add(box);
            builder.append(text.getUnicode());
        }
        System.out.println(builder.toString() + " [(X=" + boundingBox.getX() + ",Y=" + boundingBox.getY()
                 + ") height=" + boundingBox.getHeight() + " width=" + boundingBox.getWidth() + "]");
    }
}

(ExtractWordCoordinates inner class)

and run it like this:

PDDocument document = PDDocument.load(resource);
PDFTextStripper stripper = new GetWordLocationAndSize();
stripper.setSortByPosition( true );
stripper.setStartPage( 0 );
stripper.setEndPage( document.getNumberOfPages() );

Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);

(ExtractWordCoordinates test testExtractWordsForGoodJuJu)

Applied to the apache.pdf example the tutorials use you get:

2017-8-6 [(X=26.004425048828125,Y=22.00372314453125) height=5.833024024963379 width=36.31868362426758]
Welcome [(X=226.44479370117188,Y=22.00372314453125) height=5.833024024963379 width=36.5999755859375]
to [(X=265.5881652832031,Y=22.00372314453125) height=5.833024024963379 width=8.032623291015625]
The [(X=276.1641845703125,Y=22.00372314453125) height=5.833024024963379 width=14.881439208984375]
Apache [(X=293.5890197753906,Y=22.00372314453125) height=5.833024024963379 width=29.848846435546875]
Software [(X=325.98126220703125,Y=22.00372314453125) height=5.833024024963379 width=35.271636962890625]
Foundation! [(X=363.7962951660156,Y=22.00372314453125) height=5.833024024963379 width=47.871429443359375]
Custom [(X=334.0334777832031,Y=157.6195068359375) height=4.546705722808838 width=25.03936767578125]
Search [(X=360.8929138183594,Y=157.6195068359375) height=4.546705722808838 width=22.702728271484375]

Milan Hlinák · Answer 2 · 2020-07-18T09:07:45.607

You can create CustomPDFTextStripper which extends PDFTextStripper and override protected void writeString(String text, List<TextPosition> textPositions). In this overriden method you need to split textPositions by the word separator to get List<TextPosition> for each word. After that you can join each character and compute bounding box.

Full example below which contains also drawing of the resulting bounding boxes.

package com.example;

import lombok.Value;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.rendering.PDFRenderer;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import org.junit.Ignore;
import org.junit.Test;

import javax.imageio.ImageIO;
import java.awt.*;
import java.awt.image.BufferedImage;
import java.io.*;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;

public class PdfBoxTest {

    private static final String BASE_DIR_PATH = "C:\\Users\\Milan\\50330484";
    private static final String INPUT_FILE_PATH = "input.pdf";
    private static final String OUTPUT_IMAGE_PATH = "output.jpg";
    private static final String OUTPUT_BBOX_IMAGE_PATH = "output-bbox.jpg";

    private static final float FROM_72_TO_300_DPI = 300.0f / 72.0f;

    @Test
    public void run() throws Exception {
        pdfToImage();
        drawBoundingBoxes();
    }

    @Ignore
    @Test
    public void pdfToImage() throws IOException {
        PDDocument document = PDDocument.load(new File(BASE_DIR_PATH, INPUT_FILE_PATH));
        PDFRenderer renderer = new PDFRenderer(document);
        BufferedImage image = renderer.renderImageWithDPI(0, 300);
        ImageIO.write(image, "JPEG", new File(BASE_DIR_PATH, OUTPUT_IMAGE_PATH));
    }

    @Ignore
    @Test
    public void drawBoundingBoxes() throws IOException {

        PDDocument document = PDDocument.load(new File(BASE_DIR_PATH, INPUT_FILE_PATH));

        List<WordWithBBox> words = getWords(document);

        draw(words);
    }

    private List<WordWithBBox> getWords(PDDocument document) throws IOException {

        CustomPDFTextStripper customPDFTextStripper = new CustomPDFTextStripper();
        customPDFTextStripper.setSortByPosition(true);
        customPDFTextStripper.setStartPage(0);
        customPDFTextStripper.setEndPage(1);

        Writer writer = new OutputStreamWriter(new ByteArrayOutputStream());
        customPDFTextStripper.writeText(document, writer);

        List<WordWithBBox> words = customPDFTextStripper.getWords();

        return words;
    }

    private void draw(List<WordWithBBox> words) throws IOException {

        BufferedImage bufferedImage = ImageIO.read(new File(BASE_DIR_PATH, OUTPUT_IMAGE_PATH));

        Graphics2D graphics = bufferedImage.createGraphics();

        graphics.setColor(Color.GREEN);

        List<Rectangle> rectangles = words.stream()
                .map(word -> new Rectangle(word.getX(), word.getY(), word.getWidth(), word.getHeight()))
                .collect(Collectors.toList());
        rectangles.forEach(graphics::draw);

        graphics.dispose();

        ImageIO.write(bufferedImage, "JPEG", new File(BASE_DIR_PATH, OUTPUT_BBOX_IMAGE_PATH));
    }

    private class CustomPDFTextStripper extends PDFTextStripper {

        private final List<WordWithBBox> words;

        public CustomPDFTextStripper() throws IOException {
            this.words = new ArrayList<>();
        }

        public List<WordWithBBox> getWords() {
            return new ArrayList<>(words);
        }

        @Override
        protected void writeString(String text, List<TextPosition> textPositions) throws IOException {

            String wordSeparator = getWordSeparator();
            List<TextPosition> wordTextPositions = new ArrayList<>();

            for (TextPosition textPosition : textPositions) {
                String str = textPosition.getUnicode();
                if (wordSeparator.equals(str)) {
                    if (!wordTextPositions.isEmpty()) {
                        this.words.add(createWord(wordTextPositions));
                        wordTextPositions.clear();
                    }
                } else {
                    wordTextPositions.add(textPosition);
                }
            }

            super.writeString(text, textPositions);
        }

        private WordWithBBox createWord(List<TextPosition> wordTextPositions) {

            String word = wordTextPositions.stream()
                    .map(TextPosition::getUnicode)
                    .collect(Collectors.joining());

            int minX = Integer.MAX_VALUE;
            int minY = Integer.MAX_VALUE;
            int maxX = Integer.MIN_VALUE;
            int maxY = Integer.MIN_VALUE;

            for (TextPosition wordTextPosition : wordTextPositions) {

                minX = Math.min(minX, from72To300Dpi(wordTextPosition.getXDirAdj()));
                minY = Math.min(minY, from72To300Dpi(wordTextPosition.getYDirAdj() - wordTextPosition.getHeightDir()));
                maxX = Math.max(maxX, from72To300Dpi(wordTextPosition.getXDirAdj() + wordTextPosition.getWidthDirAdj()));
                maxY = Math.max(maxY, from72To300Dpi(wordTextPosition.getYDirAdj()));
            }

            return new WordWithBBox(word, minX, minY, maxX - minX, maxY - minY);
        }
    }

    private int from72To300Dpi(float f) {
        return Math.round(f * FROM_72_TO_300_DPI);
    }

    @Value
    private class WordWithBBox {
        private final String word;
        private final int x;
        private final int y;
        private final int width;
        private final int height;
    }
}

Note:

If you are interested in other options, you can check also Poppler

PDF to image

pdftoppm -r 300 -jpeg input.pdf output

Generate an XHTML file containing bounding box information for each word in the file.

pdftotext -r 300 -bbox input.pdf

Could someone give me an example of how to extract coordinates for a 'word' using PDFBox

2 Answers2

Linked