I need a regex to find a url which is not inside any html tag or an attribute value of any html tag

Question

I have html contents in following text.

    "This is my text to be parsed which contains url 
    http://someurl.com?param1=foo&params2=bar 
 <a href="http://thisshouldnotbetampered.com">
    some text and a url http://someotherurl.com test 1q2w
 </a> <img src="http://someasseturl.com/abc.jpeg"/>
    <span>i have a link too http://someurlinsidespan.com?xyz=abc </span> 
    "

Need a regex that will convert plain urls to hyperlink(without tampering existing hyperlink)

Expected result:

    "This is my text to be parsed which contains url 
    <a href="http://someurl.com?param1=foo&params2=bar">
http://someurl.com?param1=foo&params2=bar</a> 
 <a href="http://thisshouldnotbetampered.com">
    some text and a url http://someotherurl.com test 
1q2w </a> <img src="http://someasseturl.com/abc.jpeg"/>
    <span>i have a link too <a href="http://someurlinsidespan.com?xyz=abc">http://someurlinsidespan.com?xyz=abc</a> </span> "

Regular expressions are probably not the right tool for this job. Consider alternatives: http://nokogiri.org/ — Ant P, Jun 11 '13 at 07:01
I don't know about Ruby's implementation of regexs but that works fine in www.regex101.com. And regexs are fine as long as you know the structure of your likely inputs. — ydaetskcoR, Jun 11 '13 at 07:29

score 3 · Answer 1 · answered Jun 11 '13 at 08:20

_{Disclaimer: You shouldn't use regex for this task, use an html parser. This is a POC to demonstrate that it's possible if you expect a good formatted HTML (which you won't have anyway).}

So here's what I came up with:
(https?:\/\/(?:w{1,3}.)?[^\s]*?(?:\.[a-z]+)+)(?![^<]*?(?:<\/\w+>|\/?>))

What does this mean ?

( : group 1
https? : match http or https
\/\/ : match //
(?:w{1,3}.)? : match optionally w., ww. or www.
[^\s]*? : match anything except whitespace zero or more times ungreedy
(?:\.[a-z]+)+) : match a dot followed by [a-z] character(s), repeat this one or more times
(?! : negative lookahead
- [^<]*? : match anything except < zero or more times ungreedy
- (?:<\/\w+>|\/?>) : match a closing tag or /> or >
- ) : end of lookahead
) : end of group 1

regex101 online demo rubular online demo

Ok. My bad. The answers works but not for the real scenario So let me put it straight. I want to make all the url in the text a hyperlink. (The regex should not tampered existing hyperlinks). Also the above regexp does not capture the url parameter. — krunal shah, Jun 11 '13 at 09:47
@krunalshah I tend to provide an answer to the existing question. You didn't mention anywhere you wanted to match url parameters, not even in your examples. So what does not work in your real scenario ? — HamZa, Jun 11 '13 at 09:51

score 2 · Answer 2 · answered Jun 11 '13 at 07:50

2

Maybe you could do a search-and-replace first to remove the HTML elements. I don't know Ruby, but the regex would be something like /<(\w+).*?>.*?</\1>/. But it might be tricky if you have nested elements of the same type.

answered Jun 11 '13 at 07:50

David Knipe

3,417
1
19
19

+1 for remove tags, but that expression could be better if you want to do it with regex. For example it wouldn't remove image tags, or it would remove whole paragraphs (`
i want http://this.url
`). I'd suggest something simple like `?\w+[^>]*>`. – Qtax Jun 11 '13 at 07:54
I disagree with your interpretation of the question, but it's a moot point as the question has now changed. – David Knipe Jun 11 '13 at 18:39

score 0 · Answer 3 · answered Jun 11 '13 at 08:23

0

Maybe try http://rubular.com/ .. there are some Regex tips helps you get the desired output.

answered Jun 11 '13 at 08:23

tokhi

21,044
23
95
105

score 0 · Answer 4 · answered Jun 11 '13 at 09:43

I would do something like this:

require 'nokogiri'

doc = Nokogiri::HTML.fragment <<EOF
This is my text to be parsed which contains url 
http://someurl.com  <a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test 1q2w </a> <img src="http://someasseturl.com/abc.jpeg"/>
EOF

doc.search('*').each{|n| n.replace "\n"}

URI.extract doc.text
#=> ["http://someurl.com"]

I need a regex to find a url which is not inside any html tag or an attribute value of any html tag

4 Answers4