-2

I have a bunch of PDF files with broken links. I need to remove those links and right now I can do the following:

  1. Remove link actions
  2. Change text color from blue to black

What I can't do is to remove blue underlines below text that was a link before.

I tried several PDF libraries for .NET (because this is my primary platform)

  • Aspost.PDF
  • PDFSharp
  • ceTe DynamicPDF
  • PDFBox

You are welcone to recommend solution on any prograning language, platform and library. I just need to do this.

mkl
  • 90,588
  • 15
  • 125
  • 265
Anubis
  • 2,484
  • 2
  • 22
  • 33
  • Be aware that *text underlines* in PDFs merely are *graphical lines somewhere on the page*, not some attribute of the text. Thus, you must use the PDF library in question to access vector graphics, not text attributes. That been said, this should be possible with any decent general purpose PDF library. – mkl Jan 11 '16 at 12:04
  • Can you link to such a PDF file? – Tilman Hausherr Jan 11 '16 at 12:07
  • @TilmanHausherr Here you go https://www.dropbox.com/s/tkfhkb9e25eby4a/original.pdf?dl=0 – Anubis Jan 11 '16 at 12:36
  • @mkl Yes I know that underlines are stored in a different way than text in PDF files. I'm not sure about vector images but many libraries have functionality to extract bitmap images from PDF files. Noone underlines was captured by Aspose.PDF and PDFBox but 2 bitmap images were. (see the link in previous comment) – Anubis Jan 11 '16 at 12:43
  • The underlines are not implemented as bitmap images but instead as vector graphics rectangles (long, slim ones). – mkl Jan 11 '16 at 14:06

1 Answers1

2

In case of the sample document the underlines are drawn as blue (RGB 0,0,1) filled vector graphics rectangles (long, slim ones). As blue only is used for the links, we can use that criterion to find the rectangles in question.

Here a sample implementation using PDFBox 1.8.10:

void removeBlueRectangles(PDDocument document) throws IOException
{
    List<?> pages = document.getDocumentCatalog().getAllPages();
    for (int i = 0; i < pages.size(); i++)
    {
        PDPage page = (PDPage) pages.get(i);
        PDStream contents = page.getContents();

        PDFStreamParser parser = new PDFStreamParser(contents.getStream()); 
        parser.parse();
        List<Object> tokens = parser.getTokens();  

        Stack<Boolean> blueState = new Stack<Boolean>();
        blueState.push(false);

        for (int j = 0; j < tokens.size(); j++)  
        {  
            Object next = tokens.get(j);
            if (next instanceof PDFOperator)
            {
                PDFOperator op = (PDFOperator) next;  
                if (op.getOperation().equals("q"))
                {
                    blueState.push(blueState.peek());
                }
                else if (op.getOperation().equals("Q"))
                {
                    blueState.pop();
                }
                else if (op.getOperation().equals("rg"))
                {
                    if (j > 2)
                    {
                        Object r = tokens.get(j-3);
                        Object g = tokens.get(j-2);
                        Object b = tokens.get(j-1);
                        if (r instanceof COSNumber && g instanceof COSNumber && b instanceof COSNumber)
                        {
                            blueState.pop();
                            blueState.push((
                                    Math.abs(((COSNumber)r).floatValue() - 0) < 0.001 &&
                                    Math.abs(((COSNumber)g).floatValue() - 0) < 0.001 &&
                                    Math.abs(((COSNumber)b).floatValue() - 1) < 0.001));
                        }
                    }
                }
                else if (op.getOperation().equals("f"))
                {
                    if (blueState.peek() && j > 0)
                    {
                        Object re = tokens.get(j-1);
                        if (re instanceof PDFOperator && ((PDFOperator)re).getOperation().equals("re"))
                        {
                            tokens.set(j, PDFOperator.getOperator("n"));
                        }
                    }
                }
            }
        }

        PDStream updatedStream = new PDStream(document);  
        OutputStream out = updatedStream.createOutputStream();  
        ContentStreamWriter tokenWriter = new ContentStreamWriter(out);  
        tokenWriter.writeTokens(tokens);  
        page.setContents(updatedStream);
    }
}

(RemoveUnderlines.java)

original.pdf

Applying this to your first sample file original.pdf

public void testOriginal() throws IOException, COSVisitorException
{
    try (   InputStream resourceStream = getClass().getResourceAsStream("original.pdf")   )
    {
        PDDocument document = PDDocument.loadNonSeq(resourceStream, null);

        removeBlueRectangles(document);
        document.save("original-noBlueRectangles.pdf");

        document.close();
    }
}

(RemoveUnderlines.java)

results in

original-noBlueRectangles.pdf page 1

1178.pdf

You commented

After testing this on many files I have to say this solution works incorrectly in some cases. For example in for this file (dropbox.com/s/23g54bvt781lb93/1178.pdf?dl=0) it removes the entire content of the page. Keep searching..

So I applyed the code to your new sample file 1178.pdf

public void test1178() throws IOException, COSVisitorException
{
    try (   InputStream resourceStream = getClass().getResourceAsStream("1178.pdf")   )
    {
        PDDocument document = PDDocument.loadNonSeq(resourceStream, null);

        removeBlueRectangles(document);
        document.save(new File(RESULT_FOLDER, "1178-noBlueRectangles.pdf"));

        document.close();
    }
}

(RemoveUnderlines.java)

which resulted in

1178-noBlueRectangles.pdf page 1

So I cannot confirm your claim that the solution works incorrectly; in particular I see that it does not remove the entire content of the page.

As I cannot reproduce your observation, I assume there are additional issues in your setup you have not yet mentioned.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • After testing this on many files I have to say this solution works incorrectly in some cases. For example in for this file (https://www.dropbox.com/s/23g54bvt781lb93/1178.pdf?dl=0) it removes the entire content of the page. Keep searching... – Anubis Jan 13 '16 at 11:46
  • 1
    I retrieved your new file, applied the code to it, and it worked properly. Thus, your setup seems to have additional issues. – mkl Jan 13 '16 at 12:48
  • You are correct. My fault. I used .NET port for PDFBox. – Anubis Jan 13 '16 at 14:25
  • Ah, ok. I have to admit I don't know at which version that port is and whether there are substantial differences. – mkl Jan 13 '16 at 15:35