0

Result is that image is not placed correctly over text. Am i getting the text positions wrong?

This is an example on how to get the x/y coordinates and size of each character in PDF

public class MyClass extends PDFTextStripper {

    pdocument = PDDocument.load(new File(fileName));

    stripper = new GetCharLocationAndSize();
    stripper.setSortByPosition(true);
    stripper.setStartPage(0);
    stripper.setEndPage(pdocument.getNumberOfPages());
    Writer dummy = new OutputStreamWriter(new 
    ByteArrayOutputStream());
    stripper.writeText(pdocument, dummy);


 /*
 * Override the default functionality of PDFTextStripper.writeString()
 */
@Override
protected void WriteString(String string, List<TextPosition> 
textPositions) throws IOException {

     String imagePath = "image.jpg";
     PDImageXObject pdImage = 
     PDImageXObject.createFromFile(imagePath,pdocument);

     PDPageContentStream contentStream = new 
     PDPageContentStream(pdocument, stripper.getCurrentPage(), true, 
     true);

     for (TextPosition text : textPositions) {

         if (text.getUnicode().equals("a")) {
         contentStream.drawImage(pdImage, text.getXDirAdj(), 
         text.getYDirAdj(), text.getWidthDirAdj(),text.getHeightDir()); 
       }
       }
    contentStream.close();
    pdocument.save("newdoc.pdf");
    }
    }
Lez
  • 161
  • 1
  • 1
  • 13

1 Answers1

2

Retrieving sensible coordinates

You use text.getXDirAdj() and text.getYDirAdj() as x and y coordinates in the content stream. This is won't work because the coordinates PDFBox uses during text extraction are transformed into a coordinate system they prefer for text extraction purposes, cf. the JavaDocs:

/**
 * This will get the text direction adjusted x position of the character.
 * This is adjusted based on text direction so that the first character
 * in that direction is in the upper left at 0,0.
 *
 * @return The x coordinate of the text.
 */
public float getXDirAdj()

/**
 * This will get the y position of the text, adjusted so that 0,0 is upper left and it is
 * adjusted based on the text direction.
 *
 * @return The adjusted y coordinate of the character.
 */
public float getYDirAdj()

For a TextPosition text you should instead use

text.getTextMatrix().getTranslatex()

and

text.getTextMatrix().getTranslateY()

But even these numbers may have to be corrected, cf. this answer, because PDFBox has multiplied the matrix by a translation making the lower left corner of the crop box the origin.

Thus, if PDRectangle cropBox is the crop box of the current page, use

text.getTextMatrix().getTranslatex() + cropBox.getLowerLeftX()

and

text.getTextMatrix().getTranslateY() + cropBox.getLowerLeftY()

(This coordinate normalization of PDFBox is a PITA for anyone who actually wants to work with the text coordinates...)

Other issues

Your code has some other issues, one of them becoming clear with the document you shared: You append to the page content stream without resetting the graphics context:

PDPageContentStream contentStream = new PDPageContentStream(pdocument,
        stripper.getCurrentPage(), true, true);

The constructor with this signature assumes you don't want to reset the context. Use the one with an additional boolean parameter and set that to true to request context resets:

PDPageContentStream contentStream = new PDPageContentStream(pdocument,
        stripper.getCurrentPage(), true, true, true);

Now the context is reset and the position is ok again.

Both these constructors are deprecated, though, and shouldn't be used for that reason. In the development branch they have been removed already. Instead use

PDPageContentStream contentStream = new PDPageContentStream(pdocument,
        stripper.getCurrentPage(), AppendMode.APPEND, true, true);

This introduces another issue, though: You create a new PDPageContentStream for each writeString call. If that is done with context reset each time, the nesting of saveGraphicsState/restoreGraphicsState pairs may become pretty deep. Thus, you should only create one such content stream per page and use it in all writeString calls for that page.

Thus, your text stripper sub-class might look like this:

class CoverCharByImage extends PDFTextStripper {
    public CoverCharByImage(PDImageXObject pdImage) throws IOException {
        super();
        this.pdImage = pdImage;
    }

    final PDImageXObject pdImage;
    PDPageContentStream contentStream = null;

    @Override
    public void processPage(PDPage page) throws IOException {
        super.processPage(page);
        if (contentStream != null) {
            contentStream.close();
            contentStream = null;
        }
    }

    @Override
    protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
        if (contentStream == null)
            contentStream = new PDPageContentStream(document, getCurrentPage(), AppendMode.APPEND, true, true);

        PDRectangle cropBox = getCurrentPage().getCropBox();

        for (TextPosition text : textPositions) {
            if (text.getUnicode().equals("a")) {
                contentStream.drawImage(pdImage, text.getTextMatrix().getTranslateX() + cropBox.getLowerLeftX(),
                        text.getTextMatrix().getTranslateY() + cropBox.getLowerLeftY(),
                        text.getWidthDirAdj(), text.getHeightDir());
            }
        }
    }
}

(CoverCharacterByImage inner class)

and it may be used like this:

PDDocument pdocument = PDDocument.load(...);

String imagePath = ...;
PDImageXObject pdImage = PDImageXObject.createFromFile(imagePath, pdocument);

CoverCharByImage stripper = new CoverCharByImage(pdImage);
stripper.setSortByPosition(true);
Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(pdocument, dummy);
pdocument.save(...);

(CoverCharacterByImage test testCoverLikeLez)

resulting in

screenshot

etc.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • Works for most pdfs. Found a set of pdfs that the cordinates gotten from text.getTextMatrix().getTranslateY() + cropBox.getLowerLeftY() and text.getTextMatrix().getTranslateX() + cropBox.getLowerLeftX() are inaccurate. – Lez May 29 '18 at 14:34
  • @Lez Please share examples. I'm interested if there are still other *normalizations* making one's life hard... – mkl May 29 '18 at 16:41
  • here is a sample file[https://drive.google.com/file/d/1SCoB1RyvQSNy3aVjj_KZ70IO2ksN6qfM/view?usp=sharing] .Have feeling it has something to do with the software encoding which is Skia/PDF m55 – Lez May 30 '18 at 14:19
  • @Lez Thanks! I'll have a look, but most probably not before the start of next week. – mkl May 30 '18 at 14:38
  • @Lez I had a deeper look at your code and added the changes to my answer which are required to make it run with PDFs like your EMPLOYMENTCONTRACTTEMPLATE.pdf, too. Another issue to solve would be support for rotated text, by the way... – mkl Jun 04 '18 at 15:14
  • works like magic...Thanks. But the text height does not work well for characters like "p" and "g". The image does not fully cover these letters. – Lez Jun 06 '18 at 09:48
  • The coordinates you get are not the coordinates of the bottom of the glyph but of its base line, and some glyphs do have parts underneath the base line.Thus, one actually needs both the ascent (height above the base line) and also the descent (depth below the base line). Unfortunately though, the PDFBox text stripper architecture is not really interested in providing those information in an adequate manner. – mkl Jun 06 '18 at 10:02
  • after creating the pdf when I try and reload the pdf i get the WARNING: "The end of the stream is out of range, using workaround to read the stream, stream start position: 21602, length: 1479, expected end position: 23081 then followed by the ERROR:Error reading stream, expected='endstream' actual='' at offset 21602 – Lez Jun 12 '18 at 12:06
  • With the code above and your sample file I did not have any such issues. Please make that a [stack overflow question](https://stackoverflow.com/questions/ask) in its own right and supply the required code and data. – mkl Jun 12 '18 at 13:47
  • Just realized PDFTextStripper reads text one word at a time. I am trying to redact a phrase but the stripper.writeString() function only passes one word at a time – Lez Nov 27 '18 at 15:21
  • @Lez "I am trying to redact a phrase but the stripper.writeString() function only passes one word at a time" - in that case implement `writeString` to collect the incoming data and scan the collected data for your phrase. – mkl Nov 27 '18 at 17:14