PDFBox getting wrong TextPositions in a specific pdf document

Question

The Context

I've been working on a program that gets a pdf, highlights some words (via pdfbox Mark Annotation) and saves the new pdf.

For this I extend the PDFTextStripper class, in order to override the writeString() method and get the TextPositions of each word (box), so that I know exactly where the text is in the PDF doc in terms of coordinates (TextPosition object provides me the coordinates of each word box). Then, based on that, I draw a PDRectangle highlighting the word I want to.

The Problem

It works perfectly for all the documents I've tried so far, except for one that the positions I'm getting from TextPostions seem to be wrong, leading to wrong highlights.

This is the original document:
https://pdfhost.io/v/b1Mcpoy~s_Thomson.pdf

This is the document with a highlighting in the very first word box writeString() provides me, with setSortByPosition(false), which is MicroRNA:
https://pdfhost.io/v/V6INb4Xet_Thomson.pdf
It should highlight MicroRNA, but it is highlighting a blank space above it (pink HL rectangle).

This is the document with a highlighting in the very first word box writeString() provides me, with setSortByPosition(true), which is Original:
https://pdfhost.io/v/Lndh.j6ji_Thomson.pdf
It should highlight Original, but it is highlighting a blank space at the very beginning of the PDF document (pink HL rectangle).

This PDF might contain something that PDFBox struggles to get the right positions, I suppose, or this may be a sort of a bug in PDFBox.

Technical Specification:

PDFBox 2.0.17
Java 11.0.6+10, AdoptOpenJDK
MacOS Catalina 10.15.4, 16gb, x86_64

Coordinates Values

So for instance for the start and end of the MicroRNA word box, the TextPosition coordinates writeString() gives me are:

M letter

endX = 59.533783
endY = 682.696
maxHeight = 13.688589
rotation = 0
x = 35.886597
y = 99.26935
pageHeight = 781.96533
pageWidth = 586.97034
widthOfSpace = 11.9551
font = PDType1CFont JCFHGD+AdvT108
fontSize = 1.0
unicode = M
direction = -1.0

A Letter

endX = 146.34933
endY = 682.696
maxHeight = 13.688589
rotation = 0
x = 129.18181
y = 99.26935
pageHeight = 781.96533
pageWidth = 586.97034
widthOfSpace = 11.9551
font = PDType1CFont JCFHGD+AdvT108
fontSize = 1.0
fontSizePt = 23
unicode = A
direction = -1.0

And it results in the wrong HL annotation I shared above, while for all other PDF docs this is just very precise, and I've tested many different ones. I'm clueless here and I'm not an expert on PDF positionings. I've tried to use the PDFbox debugger tool, but I can't read it properly. Any help here would be very appreciated. Let me know if I can provide more evidence. Thanks.

EDIT

Note that text extraction is working just fine.

My Code

First I create an array of coordinates with a few values from TextPosition object of the first and last character I want to HL:

private void extractHLCoordinates(TextPosition firstPosition, TextPosition lastPosition, int pageNumber) {
    double firstPositionX = firstPosition.getX();
    double firstPositionY = firstPosition.getY();
    double lastPositionEndX = lastPosition.getEndX();
    double lastPositionY = lastPosition.getY();

    double height = firstPosition.getHeight();
    double width = firstPosition.getWidth();
    int rotation = firstPosition.getRotation();

    double[] wordCoordinates = {firstPositionX, firstPositionY, lastPositionEndX, lastPositionY, pageNumber, 
    height, width, rotation};

    
    ...
}

Now it's drawing time based on the extracted coordinates:

for (int pageIndex = 0; pageIndex < pdDocument.getNumberOfPages(); pageIndex++) {

    DPage page = pdDocument.getPage(pageIndex);
    List<PDAnnotation> annotations = page.getAnnotations();

    int rotation;
    double pageHeight = page.getMediaBox().getHeight();
    double pageWidth  = page.getMediaBox().getWidth();
    
    // each CoordinatePoint obj holds the double array with the 
    // coordinates of each word I want to HL - see the previous method
    for (CoordinatePoint coordinate : coordinates) {
        double[] wordCoordinates = coordinate.getCoordinates();
        
        int pageNumber = (int) wordCoordinates[4];

        // if the current coordinates are not related to the current page, 
        //ignore them
        if ((int) pageNumber == (pageIndex + 1)) {
            // getting rotation of the page: portrait, landscape...
            rotation = (int) wordCoordinates[7];

            firstPositionX = wordCoordinates[0];
            firstPositionY = wordCoordinates[1];
            lastPositionEndX = wordCoordinates[2];
            lastPositionY = wordCoordinates[3];
            height = wordCoordinates[5];

            double height;
            double minX;
            double maxX;
            double minY;
            double maxY;
            
            if (rotation == 90) {

                double width = wordCoordinates[6];
                width = (pageHeight * width) / pageWidth;

                //defining coordinates of a rectangle
                maxX = firstPositionY;
                minX = firstPositionY - height;
                minY = firstPositionX;
                maxY = firstPositionX + width;
            } else {
                minX = firstPositionX;
                maxX = lastPositionEndX;
                minY = pageHeight - firstPositionY;
                maxY = pageHeight - lastPositionY + height;
            }
                    
            // Finally I draw the Rectangle
            PDAnnotationTextMarkup txtMark = new PDAnnotationTextMarkup(PDAnnotationTextMarkup.SUB_TYPE_HIGHLIGHT);

            PDRectangle pdRectangle = new PDRectangle();
            pdRectangle.setLowerLeftX((float) minX);
            pdRectangle.setLowerLeftY((float) minY);
            pdRectangle.setUpperRightX((float) maxX);
            pdRectangle.setUpperRightY((float) ((float) maxY + height));

            txtMark.setRectangle(pdRectangle);

            // And the QuadPoints
            float[] quads = new float[8];
            quads[0] = pdRectangle.getLowerLeftX();  // x1
            quads[1] = pdRectangle.getUpperRightY() - 2; // y1
            quads[2] = pdRectangle.getUpperRightX(); // x2
            quads[3] = quads[1]; // y2
            quads[4] = quads[0];  // x3
            quads[5] = pdRectangle.getLowerLeftY() - 2; // y3
            quads[6] = quads[2]; // x4
            quads[7] = quads[5]; // y5

            txtMark.setQuadPoints(quads);
            ...
        }
    }

If the pdf was made of images, you shouldn't be able to use the text extraction. I'm not sure though if that's your issue. — Richard Barker, Oct 14 '20 at 23:32
Unfortunately you don't show your pivotal code, so it is unclear which pdfbox coordinate normalizations you have considered and which not. Have you for example considered the crop box normalization, cf. [this answer](https://stackoverflow.com/a/46113333/1729265)? — mkl, Oct 15 '20 at 06:45
@RichardBarker yeah, actually the text extraction works pretty well on this document, so I'm afraid that's not the issue. — Thales Valias, Oct 15 '20 at 08:45
@TilmanHausherr good point, thanks. I've updated it, but the issue is still happening the same way. — Thales Valias, Oct 15 '20 at 08:49
@mkl you're right, I'll edit my question adding my code. No, I'm not using crop box normalization here and I'm looking into that right now. I'll feedback later, thanks! — Thales Valias, Oct 15 '20 at 08:50
Your Quadpoints coordinates are computed relative to CropBox but they need to be relative to MediaBox. For this document the CropBox is smaller than the MediaBox so the highlight is not in the correct position. Adjust the x with CropBox.LLX - MediaBox.LLY and y with MediaBox.URY-CropBox.URY and the highlight will be in the right position. — iPDFdev, Oct 15 '20 at 11:56
@mkl I edited my question adding my drawing logic. I'm afraid I'm not using any pdfBox coordinate normalization? — Thales Valias, Oct 15 '20 at 13:15
@iPDFdev hey, I updated my question adding my drawing logic. Interesting, so some documents need to be relative to CropBox and others to MediaBox? In that case, I'd have to implement a way of differentiating the documents and routing them to the appropriate Quadpoints normalization, is that right? — Thales Valias, Oct 15 '20 at 13:21
No, always relative to MediaBox. But most of the documents have MediaBox=CropBox, so the difference I mentioned is 0. — iPDFdev, Oct 15 '20 at 13:39
@iPDFdev well, it worked, as you said, without breaking anything that was already working! Tomorrow I'll do a regression test and see how it goes, but for me this is already solved and a very good learning. Thank you very much and feel free to post an answer so that I can accept it, if you want to. Cheers! — Thales Valias, Oct 15 '20 at 17:42

score 2 · Accepted Answer · edited Oct 16 '20 at 09:26

Your Quadpoints coordinates are computed relative to CropBox but they need to be relative to MediaBox. For this document the CropBox is smaller than the MediaBox so the highlight is not in the correct position. Adjust the x with CropBox.LLX - MediaBox.LLY and y with MediaBox.URY - CropBox.URY and the highlight will be in the right position.
The adjustment above works for pages with Rotate = 0. If Rotate != 0 then further adjustments might be needed depending on how the coordinates are returned by PDFBox (I'm not very familiar with PDFBox API).

OP EDIT

Posting here the changes I've done to my code so it may help others. Note that I haven't tried anything for rotate == 90 yet. I'll update here once I have this piece.

Before

...
if (rotation == 90) {

    double width = wordCoordinates[6];
    width = (pageHeight * width) / pageWidth;

    //defining coordinates of a rectangle
    maxX = firstPositionY;
    minX = firstPositionY - height;
    minY = firstPositionX;
    maxY = firstPositionX + width;
} else {
    minX = firstPositionX;
    maxX = lastPositionEndX;
    minY = pageHeight - firstPositionY;
    maxY = pageHeight - lastPositionY + height;
}
...

After

...

PDRectangle mediaBox = page.getMediaBox();
PDRectangle cropBox = page.getCropBox();

if (rotation == 90) {

    double width = wordCoordinates[6];
    width = (pageHeight * width) / pageWidth;

    //defining coordinates of a rectangle
    maxX = firstPositionY;
    minX = firstPositionY - height;
    minY = firstPositionX;
    maxY = firstPositionX + width;
} else {
    minX = firstPositionX + cropBox.getLowerLeftX() - mediaBox.getLowerLeftY();
    maxX = lastPositionEndX + cropBox.getLowerLeftX() - mediaBox.getLowerLeftY();
    minY = pageHeight - firstPositionY - (mediaBox.getUpperRightY() - cropBox.getUpperRightY());
    maxY = pageHeight - lastPositionY + height - (mediaBox.getUpperRightY() - cropBox.getUpperRightY());
}
...

PDFBox getting wrong TextPositions in a specific pdf document

1 Answers1