The Context
I've been working on a program that gets a pdf, highlights some words (via pdfbox Mark Annotation) and saves the new pdf.
For this I extend the PDFTextStripper class, in order to override the writeString() method and get the TextPositions of each word (box), so that I know exactly where the text is in the PDF doc in terms of coordinates (TextPosition object provides me the coordinates of each word box). Then, based on that, I draw a PDRectangle highlighting the word I want to.
The Problem
It works perfectly for all the documents I've tried so far, except for one that the positions I'm getting from TextPostions seem to be wrong, leading to wrong highlights.
This is the original document:
https://pdfhost.io/v/b1Mcpoy~s_Thomson.pdf
This is the document with a highlighting in the very first word box writeString() provides me, with setSortByPosition(false), which is MicroRNA:
https://pdfhost.io/v/V6INb4Xet_Thomson.pdf
It should highlight MicroRNA, but it is highlighting a blank space above it (pink HL rectangle).
This is the document with a highlighting in the very first word box writeString() provides me, with setSortByPosition(true), which is Original:
https://pdfhost.io/v/Lndh.j6ji_Thomson.pdf
It should highlight Original, but it is highlighting a blank space at the very beginning of the PDF document (pink HL rectangle).
This PDF might contain something that PDFBox struggles to get the right positions, I suppose, or this may be a sort of a bug in PDFBox.
Technical Specification:
PDFBox 2.0.17
Java 11.0.6+10, AdoptOpenJDK
MacOS Catalina 10.15.4, 16gb, x86_64
Coordinates Values
So for instance for the start and end of the MicroRNA word box, the TextPosition coordinates writeString() gives me are:
M letter
endX = 59.533783
endY = 682.696
maxHeight = 13.688589
rotation = 0
x = 35.886597
y = 99.26935
pageHeight = 781.96533
pageWidth = 586.97034
widthOfSpace = 11.9551
font = PDType1CFont JCFHGD+AdvT108
fontSize = 1.0
unicode = M
direction = -1.0
A Letter
endX = 146.34933
endY = 682.696
maxHeight = 13.688589
rotation = 0
x = 129.18181
y = 99.26935
pageHeight = 781.96533
pageWidth = 586.97034
widthOfSpace = 11.9551
font = PDType1CFont JCFHGD+AdvT108
fontSize = 1.0
fontSizePt = 23
unicode = A
direction = -1.0
And it results in the wrong HL annotation I shared above, while for all other PDF docs this is just very precise, and I've tested many different ones. I'm clueless here and I'm not an expert on PDF positionings. I've tried to use the PDFbox debugger tool, but I can't read it properly. Any help here would be very appreciated. Let me know if I can provide more evidence. Thanks.
EDIT
Note that text extraction is working just fine.
My Code
First I create an array of coordinates with a few values from TextPosition object of the first and last character I want to HL:
private void extractHLCoordinates(TextPosition firstPosition, TextPosition lastPosition, int pageNumber) {
double firstPositionX = firstPosition.getX();
double firstPositionY = firstPosition.getY();
double lastPositionEndX = lastPosition.getEndX();
double lastPositionY = lastPosition.getY();
double height = firstPosition.getHeight();
double width = firstPosition.getWidth();
int rotation = firstPosition.getRotation();
double[] wordCoordinates = {firstPositionX, firstPositionY, lastPositionEndX, lastPositionY, pageNumber,
height, width, rotation};
...
}
Now it's drawing time based on the extracted coordinates:
for (int pageIndex = 0; pageIndex < pdDocument.getNumberOfPages(); pageIndex++) {
DPage page = pdDocument.getPage(pageIndex);
List<PDAnnotation> annotations = page.getAnnotations();
int rotation;
double pageHeight = page.getMediaBox().getHeight();
double pageWidth = page.getMediaBox().getWidth();
// each CoordinatePoint obj holds the double array with the
// coordinates of each word I want to HL - see the previous method
for (CoordinatePoint coordinate : coordinates) {
double[] wordCoordinates = coordinate.getCoordinates();
int pageNumber = (int) wordCoordinates[4];
// if the current coordinates are not related to the current page,
//ignore them
if ((int) pageNumber == (pageIndex + 1)) {
// getting rotation of the page: portrait, landscape...
rotation = (int) wordCoordinates[7];
firstPositionX = wordCoordinates[0];
firstPositionY = wordCoordinates[1];
lastPositionEndX = wordCoordinates[2];
lastPositionY = wordCoordinates[3];
height = wordCoordinates[5];
double height;
double minX;
double maxX;
double minY;
double maxY;
if (rotation == 90) {
double width = wordCoordinates[6];
width = (pageHeight * width) / pageWidth;
//defining coordinates of a rectangle
maxX = firstPositionY;
minX = firstPositionY - height;
minY = firstPositionX;
maxY = firstPositionX + width;
} else {
minX = firstPositionX;
maxX = lastPositionEndX;
minY = pageHeight - firstPositionY;
maxY = pageHeight - lastPositionY + height;
}
// Finally I draw the Rectangle
PDAnnotationTextMarkup txtMark = new PDAnnotationTextMarkup(PDAnnotationTextMarkup.SUB_TYPE_HIGHLIGHT);
PDRectangle pdRectangle = new PDRectangle();
pdRectangle.setLowerLeftX((float) minX);
pdRectangle.setLowerLeftY((float) minY);
pdRectangle.setUpperRightX((float) maxX);
pdRectangle.setUpperRightY((float) ((float) maxY + height));
txtMark.setRectangle(pdRectangle);
// And the QuadPoints
float[] quads = new float[8];
quads[0] = pdRectangle.getLowerLeftX(); // x1
quads[1] = pdRectangle.getUpperRightY() - 2; // y1
quads[2] = pdRectangle.getUpperRightX(); // x2
quads[3] = quads[1]; // y2
quads[4] = quads[0]; // x3
quads[5] = pdRectangle.getLowerLeftY() - 2; // y3
quads[6] = quads[2]; // x4
quads[7] = quads[5]; // y5
txtMark.setQuadPoints(quads);
...
}
}