1

I have a PDF file with colored text that I need to remove. I couldn't find much help anywhere so I dug in and figured it out with the help of this post: PDFBox 2.0 RC3 -- Find and replace text

As I there isn't much about this I suspect that few people care, still, thought I'd share.

private void setTextBlack(PDDocument pdDocument) throws IOException {
    for ( PDPage pdPage: pdDocument.getPages()) {
        PDFStreamParser parser = new PDFStreamParser(pdPage);
        parser.parse();
        java.util.List tokens = parser.getTokens();
        for ( int i=0; i<tokens.size(); i++ ) {
            Object next = tokens.get(i);
            if ( next instanceof Operator && ((Operator) next).getName().equals("BT") ) {
                for ( int j=i+1; j< tokens.size(); j++ ) {
                    Object btToken = tokens.get(j);
                    if ( btToken instanceof Operator && ((Operator) btToken).getName().equals("rg") ) {
                        int n = j - 1;
                        while (tokens.get(n) instanceof COSInteger || tokens.get(n) instanceof COSFloat) {
                            tokens.set(n, new COSFloat(0f));
                            n--;
                        }
                    }
                    if ( btToken instanceof Operator && ((Operator) btToken).getName().equals("ET")) {
                        break;
                    }
                }
            }
        }
        PDStream updatedStream = new PDStream(pdDocument);
        OutputStream out = updatedStream.createOutputStream(COSName.FLATE_DECODE);
        ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
        tokenWriter.writeTokens(tokens);
        pdPage.setContents(updatedStream);
        out.close();
    }
}
Tilman Hausherr
  • 17,731
  • 7
  • 58
  • 97
Ted M
  • 11
  • 1
  • 2
    You should add an answer to the existing question, rather than post this question if it doesn't ask a question. Alternatively, post an answer to this question and edit your post to phrase it as a question (assuming your question is different enough from the one you found). – Peter O. Apr 09 '21 at 10:58
  • 1
    This will work only on a few files because there are many ways to set a color. And there's no guarantee that the color is set between BT and ET, it can also be set before the BT. And you've not considered Q...q. – Tilman Hausherr Apr 09 '21 at 13:20
  • 2
    As @Tilman says, this works only for special cases. You should base your solution on the pdfbox parsing framework and at least consider all types of fill colors except probably pattern colors (which may turn out to difficult to handle). Also don't forget that there are other content streams than the immediate page contents. Furthermore, colors may be changed after by creative use of blend modes afterwards... – mkl Apr 09 '21 at 19:32
  • Can someone please help everyone with the code which will take care of if not all, almost all cases? – Gentleman Feb 28 '22 at 11:30

0 Answers0