Find All URL That Is Not An HTML Attribute or Content of A Hyperlink Tag

Question

I'm trying to figure out a regex that matches all URL that are not an attribute of an element or is a content of a hyperlink.

Should match:

 1. This is a url http://www.google.com

Should not match:

 1. <a href="http://www.google.com">Google</a>
 2. <a href="http://www.google.com">http://www.google.com</a> 
 3. <img src="http://www.google.com/image.jpg">
 4. <div data-url="http://www.google.com"></div>

I'm currently using this regex to match all URL and I think I know what I have to detect, but I just can't figure out using regex.

\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]

EDITED

What I'm trying to achieve is the following. I want to convert this string.

This is a url http://www.google.com <a href="http://www.google.com" title="Go to Google">Google</a><a href="http://www.google.com">http://www.google.com</a><img src="http://www.google.com/image.jpg"><div data-url="http://www.google.com"></div>

To

This is a url <a href="http://www.google.com">http://www.google.com</a> <a href="http://www.google.com" title="Go to Google">Google</a><a href="http://www.google.com">http://www.google.com</a><img src="http://www.google.com/image.jpg"><div data-url="http://www.google.com"></div>

Preprocessing by removing tags and then put them back doesn't solve the problem since actually ends up removing all data attributes of the existing hyperlink elements. It also doesn't solve the issue when there are other URL using in other attributes beside href.

So far, I haven't found a solution suggested by anyone and so far I also haven't found a way to do this using HTML parser. It's actually seem more doable using regex.

EDITED 2

After the attempt based on Dean's suggestion, I'm about ready to rule out HTML parser from being able to achieve this for it inability to process string without making it a valid HTML document. Here's the code based on the suggested example + the fix to handle exclusion case 2.

    Document doc = Jsoup.parseBodyFragment(htmlText);

    final List<TextNode> nodesToChange = new ArrayList<TextNode>();

    NodeTraversor nd  = new NodeTraversor(new NodeVisitor() {

        @Override
        public void tail(Node node, int depth) {
            if (node instanceof TextNode) {
                TextNode textNode = (TextNode) node;
                Node parent = node.parent();
                if(parent.nodeName().equals("a")){
                    return;
                }

                String text = textNode.getWholeText();

                List<String> allMatches = new ArrayList<String>();
                Matcher m = Pattern.compile("\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]")
                        .matcher(text);
                while (m.find()) {
                    allMatches.add(m.group());
                }

                if(allMatches.size() > 0){
                    nodesToChange.add(textNode);
                }
            }
        }

        @Override
        public void head(Node node, int depth) {        
        }
    });

    nd.traverse(doc.body());

This code adds HTML, HEAD and BODY tag to the result. The only hack I can think of around this issue is to check whether HTML, HEAD and BODY tags exist in the string. If not, stripe them out after processing.

I hope someone else has a better suggestion than this hack. Using JSOUP is already very expensive in terms of processing time so I really don't want to add more overhead if I don't have to.

*"but I just can't figure out using regex."* RegEx was never meant to parse HTML. Use an HTML parser. http://stackoverflow.com/q/1732348/418556 — Andrew Thompson, Dec 14 '13 at 21:34
Regular expressions are a powerful formalism, but they are not well suited for extracting data from html or XML. You should be using an XML Query language such as XQuery, XPath or XSLT or an XML API such as SAX within a pre-processing step. In this preprocessing step you could get rid of all attributes and anchor tags. If your html is not well-formed you will have to use an HTML cleaner in another preprocessing step. — user152468, Dec 14 '13 at 21:37
@AndrewThompson I'm actually fine with using HTML parser also. The how would you approach it. Let's say I have this string "This is a url http://www.google.com Google http://www.google.com
". Everyone kept suggesting using HTML/XML parser, but no one has suggested a way to solve this. XML parser can't be used here since it's not a well-formed XML. With HTML parser, I still need to find a way to process it. — juminoz, Dec 14 '13 at 22:36
@user152468 is surely better placed to advise on the specifics. My 'expertise' on the matter was gained via other people, and was pretty much exhausted in my initial comment. Sorry. — Andrew Thompson, Dec 14 '13 at 22:40
BTW - please don't try to put code, HTML etc. in comments.. Edit the question instead. — Andrew Thompson, Dec 14 '13 at 22:41
If you are dead against using a HTML parser, consider using negative lookbehind (for `\bsrc="` or `\bhref="`) and lookaheads for `"\b` - but you are likely to create yourself unwanted edge cases and problems. — Dean Taylor, Dec 14 '13 at 23:11
This does not look like an HTML parsing problem, since you're looking for things that LOOK like URL's in an arbitrary text file or string. Unless your input is HTML/XML or something and the sample string you're using is just the text of a particular element. — MxLDevs, Dec 15 '13 at 00:03
The input can be anything. Sometime it maybe a well-formed HTML. For most cases, it will simply be some string containing HTML elements. It's still most efficient with regex and based on my understanding, it should be doable. I'm just not very good with regex. Traversing through HTML using JSOUP is just an overkill. — juminoz, Dec 15 '13 at 09:43
@juminoz I don't know how Jsoup handles non-HTML documents since I never tried it myself (I probably should) but if possible I'd much rather use Jsoup or something else to separate HTML from non-HTML, and then run your regex on the non-HTML portions. You *really* don't want to have to write your regex to determine whether the string you're looking at is actually part of a valid HTML element or not, as you'll probably have a terrible regex to manage. — MxLDevs, Dec 16 '13 at 17:07

score 2 · Answer 1 · edited May 23 '17 at 10:32

Expecting Valid HTML Output

Here is rough guide to get you started.

Use a HTML5 parsing engine like jsoup Java HTML Parser
- HTML5 specification deals with invalid HTML in a known specified way for predicable results.
- this parsing engine actually provides HTML modification methods too.

Parse your HTML something like this:

String html = "This is a url http://www.google.com <a href=\"http://www.google.com\" title=\"Go to Google\">Google</a>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();

Find all your text nodes (non-HTML element bits)
- You can find an example of an jsoup text iterator in this answer.
Test to see if the text looks like a link (use your regex)
Replace the text as indicated in the same example.
Obtain the HTML of the complete modified document.
Sit back and enjoy.

Edit 1 - The Crazy World of replacing in Invalid HTML

It seems the author of this question has indicated that the content is not valid HTML and requires the invalid HTML to be maintained - as such a HTML parser shouldn't be used as any HTML parser would likely output valid HTML when saving.

As indicated in my comment to the original question you can use negative look behinds in regex. But only a fool would parse HTML with RegEx - apparently we aren't so here is one possible example.

I wouldn't use this in production code - but it answers OP's question

The RegEx

Unfortunately Java doesn't support unlimited look-behinds so I have included the following limits:

Tag name - max of 255 characters
Spaces - max of 30 characters
Attribute contents (including attributes and values) - max of 4098 characters

Negative Look-behind

Regular expression visualization Note that this visualization is incorrect as [\p{L}0-9_.-] was replaced with [A-Z0-9_.-] to get visualisation to work - but \p{L} is technically more correct as "Any Unicode Letter" is possible.

Complete Regex

# Negative look-behind
(?<!
## N1: Looks like an HTML attribute value inside a HTML tag
### N1: Tag name
<[A-Z0-9]{1,255}
### N1: Any HTML attributes and values
(?:\s{1,30}[^<>]{0,4098})?
### N1: The begining of a HTML attribute with value
\s{1,30}
[\p{L}0-9_.-]{1,255}
\s{0,30}=\s{0,30}
### N1: Optional HTML attribute quotes
["']?
|
## N2: Looks like the start of an HTML tag text content
### N2: Tag name
<[A-Z0-9]{1,255}\s{1,30}
### N2: All HTML attributes and values
[^<>]{0,4098}
### N2: End of HTML opening tag
>
)
## Positive match: The URL value
((?:https?|ftp|file)://[-a-zA-Z0-9+&@\#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@\#/%=~_|])

The Java

import java.util.*;
import java.lang.*;
import java.io.*;
import java.util.regex.*;

class CrazyInvalidHtmlUrlTextFindAndReplacer
{
    public static final String EXAMPLE_TEST = "This is a url http://www.google.com <a href=\"http://www.google.com\" title=\"Go to Google\">Google</a><a href=\"http://www.google.com\">http://www.google.com</a><img src=\"http://www.google.com/image.jpg\"><div data-url=\"http://www.google.com\"></div>";
    public static final String EXPECTED_OUTPUT_TEST = "This is a url <a href=\"http://www.google.com\">http://www.google.com</a> <a href=\"http://www.google.com\" title=\"Go to Google\">Google</a><a href=\"http://www.google.com\">http://www.google.com</a><img src=\"http://www.google.com/image.jpg\"><div data-url=\"http://www.google.com\"></div>";

    public static void main (String[] args) throws java.lang.Exception
    {
        System.out.println("Starting our non-HTML search and replace...");
        StringBuffer resultString = new StringBuffer();
        String subjectString = new String(EXAMPLE_TEST);
        System.out.println(subjectString);
        try {
            Pattern regex = Pattern.compile(
    "# Negative lookbehind\n" +
        "(?<!\n" +
        "## N1: Looks like an HTML attribute value inside a HTML tag\n" +
        "### N1: Tag name\n" +
        "<[A-Z0-9]{1,255}\n" +
        "### N1: Any HTML attributes and values\n" +
        "(?:\\s{1,30}[^<>]{0,4098})?\n" +
        "### N1: The begining of a HTML attribute with value\n" +
        "\\s{1,30}\n" +
        "[\\p{L}0-9_.-]{1,255}\n" +
        "\\s{0,30}=\\s{0,30}\n" +
        "### N1: Optional HTML attribute quotes\n" +
        "[\"']?\n" +
        "|\n" +
        "## N2: Looks like the start of an HTML tag text content\n" +
        "### N2: Tag name\n" +
        "<[A-Z0-9]{1,255}\\s{1,30}\n" +
        "### N2: All HTML attributes and values\n" +
        "[^<>]{0,4098}\n" +
        "### N2: End of HTML opening tag\n" +
        ">\n" +
        ")\n" +
        "## Positive match: The URL value\n" +
        "((?:https?|ftp|file)://[-a-zA-Z0-9+&@\\#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@\\#/%=~_|])", 
            Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.COMMENTS);
            Matcher regexMatcher = regex.matcher(subjectString);
            while (regexMatcher.find()) {
                System.out.println("text");
                try {
                    // You can vary the replacement text for each match on-the-fly

                    // !!!!!!!!!
                    // @todo Escape the attribute values and content text.
                    // !!!!!!!!!


                    regexMatcher.appendReplacement(resultString, "<a href=\"$1\">$1</a>");
                    } catch (IllegalStateException ex) {
                    // appendReplacement() called without a prior successful call to find()
                    System.out.println("IllegalStateException");
                    } catch (IllegalArgumentException ex) {
                    // Syntax error in the replacement text (unescaped $ signs?)
                    System.out.println("IllegalArgumentException");
                    } catch (IndexOutOfBoundsException ex) {
                    // Non-existent backreference used the replacement text
                    System.out.println("IndexOutOfBoundsException");
                }
            }
            regexMatcher.appendTail(resultString);

            } catch (PatternSyntaxException ex) {
            // Syntax error in the regular expression
            System.out.println("PatternSyntaxException");
            System.out.println(ex.toString());
        }

        System.out.println("result:");
        System.out.println(resultString.toString());

        if (resultString.toString().equals(EXPECTED_OUTPUT_TEST)) {
            System.out.println("success!!!!");
            } else {
            System.out.println("failure - expected:");
            System.out.println(EXPECTED_OUTPUT_TEST);
        }
    }

}

No idea what the performance would be like on this - look-behinds are expensive - that's on top of RegEx generally being expensive too.

+1. Good to know that Jsoup can be used to separate HTML from non-HTML elements. — MxLDevs, Dec 16 '13 at 17:11
There are 2 flaws to this answer (and the mentioned example). 1) It doesn't handle the exclusion case 2. You need to add a check whether TextNode has a parent of type HyperLink (that can be done using JSOUP). 2) It modifies my string to make it a well-formed HTML on parse. Because of this, I have no idea how to get back the original string + modified parts. I need to be able to do this on invalid HTML string as well as valid HTML document. It can't just give me a valid HTML out of my string. It doesn't seem like there is a way to do this with JSOUP from my experience. — juminoz, Dec 18 '13 at 01:50
In the original question there was (and is currently) **no mention** that the input content was **invalid HTML** and that **you wanted to keep it that way**. Effectively making it "non-HTML" content. If this is the case we cannot go by HTML rules. — Dean Taylor, Dec 18 '13 at 15:17
I added **The Crazy World of replacing in Invalid HTML** example - enjoy your RegEx. — Dean Taylor, Dec 18 '13 at 18:57

user152468 · Answer 2 · 2013-12-14T23:13:25.700

As discussed in the comments to the question, solving this using a regular expression only is hard (may be impossible?). Below is an XSLT Stylesheet, that does a preprocessing step to remove all attributes and all anchor tags from the input html.

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:template match="node()">
    <xsl:copy>
      <xsl:apply-templates select="node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="a">
  </xsl:template>

</xsl:stylesheet>

Then you can run your regex to extract the remaining urls, which will be much simpler.

If your input html is not valid, then use jtidy, htmlcleaner or htmltidy as a further preprocessing step.

Hope this helps.

juminoz · Accepted Answer · 2013-12-18T17:49:04.833

Based on suggestion by Dean and the mentioned example, here's the "solution" to the problem. Keep in mind that it's a very expensive one due mainly to the parsing of HTML string (~160ms on quad-core/16GB RAM MBPr). This solution also handles both valid and invalid HTML. Keep in mind there is a little hack around the limitation of JSOUP to make sure extra tags are not included to make the end result a valid HTML. I really hope someone can come up with a better solution, but here it is for now.

public static String makeHTML(String htmlText){
    boolean isValidDoc = false;
    if((htmlText.contains("<html") || htmlText.contains("<HTML")) && 
            (htmlText.contains("<head") || htmlText.contains("<HEAD")) &&
            (htmlText.contains("<body") || htmlText.contains("<BODY"))){
        isValidDoc = true;
    }

    Document doc = Jsoup.parseBodyFragment(htmlText);
    final String urlRegex = "\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";

    final List<TextNode> nodesToChange = new ArrayList<>();
    final List<String> changedContent = new ArrayList<>();

    NodeTraversor nd  = new NodeTraversor(new NodeVisitor() {

        @Override
        public void tail(Node node, int depth) {
            if (node instanceof TextNode) {
                TextNode textNode = (TextNode) node;
                Node parent = node.parent();
                if(parent.nodeName().equals("a")){
                    return;
                }

                String text = textNode.getWholeText();

                List<String> allMatches = new ArrayList<String>();
                Matcher m = Pattern.compile(urlRegex)
                        .matcher(text);
                while (m.find()) {
                    allMatches.add(m.group());
                }

                if(allMatches.size() > 0){
                    String result = text;
                    for(String match : allMatches){
                        result = result.replace(match, "<a href=\"" + match + "\">" + match + "</a>");
                    }
                    changedContent.add(result);
                    nodesToChange.add(textNode);
                }
            }
        }

        @Override
        public void head(Node node, int depth) {        
        }
    });

    nd.traverse(doc.body());

    int count = 0;
    for (TextNode textNode : nodesToChange) {
        String result = changedContent.get(count++);
        Node newNode = new DataNode(result, textNode.baseUri());
        textNode.replaceWith(newNode);
    }

    String processed = doc.toString();
    if(!isValidDoc){
        int start = processed.indexOf("<body>") + 6;
        int end = processed.lastIndexOf("</body>");
        processed = processed.substring(start, end);
    }

    return processed;
}