-1

The y-coordinates I get back for the lines in a table seem to be stretched beyond the coordinates of the text. There seems to be some transformation going on, but I cannot find it. If possible I would like to fix the problem within the scope of the PDFGraphicsStreamEngine as extended below, and not have to go back to the drawing board with the other input streams available in PDFBox.

I have extended PDFTextStripper to acquire the location of every text glyph on the page:

public class MyPDFTextStripper extends PDFTextStripper {

    private List<TextPosition> tps;

    public MyPDFTextStripper() throws IOException {
        tps = new ArrayList<>();
    }

    @Override
    protected void writeString
            (String text,
             List<TextPosition> textPositions)
            throws IOException {
        tps.addAll(textPositions);
    }

    List<TextPosition> getTps() {
        return tps;
    }
}

and I have extended PDFGraphicsStreamEngine to extract every line on the page as a Line2D:

public class LineCatcher extends PDFGraphicsStreamEngine
{
    private final GeneralPath linePath = new GeneralPath();
    private List<Line2D> lines;

    LineCatcher(PDPage page)
    {
        super(page);
        lines = new ArrayList<>();
    }

    List<Line2D> getLines() {
        return lines;
    }

    @Override
    public void strokePath() throws IOException
    {
        Rectangle2D rect = linePath.getBounds2D();
        Line2D line = new Line2D.Double(rect.getX(), rect.getY(),
                rect.getX() + rect.getWidth(),
                rect.getY() + rect.getHeight());
        lines.add(line);
        linePath.reset();
    }

    @Override
    public void moveTo(float x, float y) throws IOException
    {linePath.moveTo(x, y);}
    @Override
    public void lineTo(float x, float y) throws IOException
    {linePath.lineTo(x, y);}
    @Override
    public Point2D getCurrentPoint() throws IOException
    {return linePath.getCurrentPoint();}

    //all other overridden methods can be left empty for the purposes of this problem.
}

I have written a simple program to demonstrate the problem:

public class PageAnalysis {
    public static void main(String[] args) {
        try (PDDocument doc = PDDocument.load(new File("onePage.pdf"))) {
            PDPage page = doc.getPage(0);

            MyPDFTextStripper ts = new MyPDFTextStripper();
            ts.getText(doc);
            List<TextPosition> tps = ts.getTps();

            System.out.println("Y coordinates in text:");
            Set<Integer> ySet = new HashSet<>();
            for (TextPosition tp: tps) {
                ySet.add((int)tp.getY());
            }
            List<Integer> yList = new ArrayList<>(ySet);
            Collections.sort(yList);
            for (int y: yList){
                System.out.print(y + "\t");
            }
            System.out.println();


            System.out.println("Y coordinates in lines:");
            LineCatcher lineCatcher = new LineCatcher(page);
            lineCatcher.processPage(page);
            List<Line2D> lines = lineCatcher.getLines();
            ySet = new HashSet<>();
            for (Line2D line: lines) {
                ySet.add((int)line.getY2());
            }
            yList = new ArrayList<>(ySet);
            Collections.sort(yList);
            for (int y: yList){
                System.out.print(y + "\t");
            }
            System.out.println();

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

The output from this is:

Y coordinates in text:
66  79  106 118 141 153 171 189 207 225 243 261 279 297 315 333 351 370 388 406 424 442 460 478 496 514 780 
Y coordinates in lines:
322 340 358 376 394 412 430 448 466 484 502 520 538 556 574 593 611 629 647 665 683 713

The last number in the text list corresponds to the y-coordinate of the page number at the bottom. I cannot find what is going on with the y-coordinates of the lines, though it seems to be those which have been transformed (the media box is the same here as it was for the text, and it fits in with the text positions). The current transformation matrix has 1.0 for yScaling also.

1 Answers1

2

Indeed, the PDFTextStripper has the bad habit of transforming coordinates into a very un-PDF'ish coordinate system, one with the origin in the upper left of the page and y coordinates increasing downwards.

For a TextPosition tp, therefore, you should not use

tp.getY()

but instead

tp.getTextMatrix().getTranslateY()

Unfortunately these coordinates still may be translated even though they are nearer to the actual PDF default coordinate system, cf. this answer: These coordinates still are transformed to have the origin in the lower left corner of the crop box.

Thus, you really need something like this:

tp.getTextMatrix().getTranslateY() + cropBox.getLowerLeftY()

where cropBox is the PDRectangle retrieved as

PDRectangle cropBox = doc.getPage(n).getCropBox();

where in turn n is the number of the page with that content.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • Thanks for that. I had been working with `TextPosition` for so long I forgot that it was not the default coordinate system. Indeed, it is so ingrained in the application, I'll be seeing if I can reverse the transformation you recommend to steer me away from it. It may be un-PDF, but it is possibly the easiest solution for now. Thanks very much for the answer. –  Oct 02 '17 at 08:15
  • 1
    Well, it is easier to reverse the transformation in the text positions as the nearly original coordinates already are accessible from therein. It also is possible, though, to retrieve the transformation details and transform the line coordinates. If your application relies on the specific text position coordinate system, you might try that instead. – mkl Oct 02 '17 at 12:18
  • Job done. I took your original advice and in this little corner of the application where line coordinates come into play, I transformed the TextPosition coordinates to make comparisons. Can't thank you enough for your thorough answer. –  Oct 06 '17 at 12:28