Remove underlines from text in PDF file

Question

I have a bunch of PDF files with broken links. I need to remove those links and right now I can do the following:

Remove link actions
Change text color from blue to black

What I can't do is to remove blue underlines below text that was a link before.

I tried several PDF libraries for .NET (because this is my primary platform)

Aspost.PDF
PDFSharp
ceTe DynamicPDF
PDFBox

You are welcone to recommend solution on any prograning language, platform and library. I just need to do this.

Be aware that *text underlines* in PDFs merely are *graphical lines somewhere on the page*, not some attribute of the text. Thus, you must use the PDF library in question to access vector graphics, not text attributes. That been said, this should be possible with any decent general purpose PDF library. — mkl, Jan 11 '16 at 12:04
@TilmanHausherr Here you go https://www.dropbox.com/s/tkfhkb9e25eby4a/original.pdf?dl=0 — Anubis, Jan 11 '16 at 12:36
@mkl Yes I know that underlines are stored in a different way than text in PDF files. I'm not sure about vector images but many libraries have functionality to extract bitmap images from PDF files. Noone underlines was captured by Aspose.PDF and PDFBox but 2 bitmap images were. (see the link in previous comment) — Anubis, Jan 11 '16 at 12:43
The underlines are not implemented as bitmap images but instead as vector graphics rectangles (long, slim ones). — mkl, Jan 11 '16 at 14:06

mkl · Accepted Answer · 2016-01-13T12:46:19.803

In case of the sample document the underlines are drawn as blue (RGB 0,0,1) filled vector graphics rectangles (long, slim ones). As blue only is used for the links, we can use that criterion to find the rectangles in question.

Here a sample implementation using PDFBox 1.8.10:

void removeBlueRectangles(PDDocument document) throws IOException
{
    List<?> pages = document.getDocumentCatalog().getAllPages();
    for (int i = 0; i < pages.size(); i++)
    {
        PDPage page = (PDPage) pages.get(i);
        PDStream contents = page.getContents();

        PDFStreamParser parser = new PDFStreamParser(contents.getStream()); 
        parser.parse();
        List<Object> tokens = parser.getTokens();  

        Stack<Boolean> blueState = new Stack<Boolean>();
        blueState.push(false);

        for (int j = 0; j < tokens.size(); j++)  
        {  
            Object next = tokens.get(j);
            if (next instanceof PDFOperator)
            {
                PDFOperator op = (PDFOperator) next;  
                if (op.getOperation().equals("q"))
                {
                    blueState.push(blueState.peek());
                }
                else if (op.getOperation().equals("Q"))
                {
                    blueState.pop();
                }
                else if (op.getOperation().equals("rg"))
                {
                    if (j > 2)
                    {
                        Object r = tokens.get(j-3);
                        Object g = tokens.get(j-2);
                        Object b = tokens.get(j-1);
                        if (r instanceof COSNumber && g instanceof COSNumber && b instanceof COSNumber)
                        {
                            blueState.pop();
                            blueState.push((
                                    Math.abs(((COSNumber)r).floatValue() - 0) < 0.001 &&
                                    Math.abs(((COSNumber)g).floatValue() - 0) < 0.001 &&
                                    Math.abs(((COSNumber)b).floatValue() - 1) < 0.001));
                        }
                    }
                }
                else if (op.getOperation().equals("f"))
                {
                    if (blueState.peek() && j > 0)
                    {
                        Object re = tokens.get(j-1);
                        if (re instanceof PDFOperator && ((PDFOperator)re).getOperation().equals("re"))
                        {
                            tokens.set(j, PDFOperator.getOperator("n"));
                        }
                    }
                }
            }
        }

        PDStream updatedStream = new PDStream(document);  
        OutputStream out = updatedStream.createOutputStream();  
        ContentStreamWriter tokenWriter = new ContentStreamWriter(out);  
        tokenWriter.writeTokens(tokens);  
        page.setContents(updatedStream);
    }
}

(RemoveUnderlines.java)

original.pdf

Applying this to your first sample file original.pdf

public void testOriginal() throws IOException, COSVisitorException
{
    try (   InputStream resourceStream = getClass().getResourceAsStream("original.pdf")   )
    {
        PDDocument document = PDDocument.loadNonSeq(resourceStream, null);

        removeBlueRectangles(document);
        document.save("original-noBlueRectangles.pdf");

        document.close();
    }
}

(RemoveUnderlines.java)

results in

1178.pdf

You commented

After testing this on many files I have to say this solution works incorrectly in some cases. For example in for this file (dropbox.com/s/23g54bvt781lb93/1178.pdf?dl=0) it removes the entire content of the page. Keep searching..

So I applyed the code to your new sample file 1178.pdf

public void test1178() throws IOException, COSVisitorException
{
    try (   InputStream resourceStream = getClass().getResourceAsStream("1178.pdf")   )
    {
        PDDocument document = PDDocument.loadNonSeq(resourceStream, null);

        removeBlueRectangles(document);
        document.save(new File(RESULT_FOLDER, "1178-noBlueRectangles.pdf"));

        document.close();
    }
}

(RemoveUnderlines.java)

which resulted in

So I cannot confirm your claim that the solution works incorrectly; in particular I see that it does not remove the entire content of the page.

As I cannot reproduce your observation, I assume there are additional issues in your setup you have not yet mentioned.

After testing this on many files I have to say this solution works incorrectly in some cases. For example in for this file (https://www.dropbox.com/s/23g54bvt781lb93/1178.pdf?dl=0) it removes the entire content of the page. Keep searching... — Anubis, Jan 13 '16 at 11:46
I retrieved your new file, applied the code to it, and it worked properly. Thus, your setup seems to have additional issues. — mkl, Jan 13 '16 at 12:48
Ah, ok. I have to admit I don't know at which version that port is and whether there are substantial differences. — mkl, Jan 13 '16 at 15:35

Remove underlines from text in PDF file

1 Answers1

original.pdf

1178.pdf