Regular expression to determine each and every attribute of an anchor tag inside HTML content

Question

I basically wanted the values of each and every attribute. The attributes may be optional and the href may contain HTTP or HTTPS.

A sample anchor tag inside content is:

 <a class=\"direct_link\" rel=\"nofollow\" target=\"_blank\" href=\"http://google.com\">link text</a>

Sample HTML content is:

<p><br></p><h1>A beautiful <a class=\"f-link\" rel=\"nofollow\" target=\"_blank\" href=\"fake.com/abc.html\">jQuery</a>; a</h1><h3 class=\"text-light\">Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's.</h3><p><br></p><p><br></p>

Does the data you want to parse only include the anchor tag itself, or is it the whole HTML page? Does the anchor tag include the slashes as you've pasted them above? — simpleigh, Aug 01 '14 at 11:27
@simpleigh Yes It will include the slashes and It will contain HTML content coming from a WYSWYG editor. — RailsEnthusiast, Aug 01 '14 at 11:29
@Unihedron Added ruby as preferred language, I am using match/scan method — RailsEnthusiast, Aug 01 '14 at 11:34
In general its not a good to parse html with regex http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — achingfingers, Aug 01 '14 at 11:34
@achingfingers Agreed But I basically wanted to update all those links with some new values, So I am going to replace old links with new ones — RailsEnthusiast, Aug 01 '14 at 11:38
What matches do you want from the tag? What's your expected output? Do you want `class=\"...\"` or just the `\"...\"`? — skamazin, Aug 01 '14 at 11:44
@skamazin Yes something like class = "class_name" href =".." etc But these attributes are optional and their sequence are not predefined — RailsEnthusiast, Aug 01 '14 at 11:50
I don't really care about the sequence so long as it's stuff all word characters, my answer below should work. Let me know if it doesn't. — skamazin, Aug 01 '14 at 11:54
@skamazin can we somehow update the below menthod regex to search only anchor tags — RailsEnthusiast, Aug 01 '14 at 11:56
Could you supply a bigger text snippet in your original question? I'll try to make it work, but I don't know the exception to my regex if I don't have the input — skamazin, Aug 01 '14 at 11:59
This one doesn't have escaped quotes though. So which is it, [`\"`] or just [`"`]. — skamazin, Aug 01 '14 at 12:05
@skamazin

A beautiful jQuery; WYSIWYG text editor
Froala WYSIWYG Editor is built on the latest technologies and according to the latest industry trends. — RailsEnthusiast, Aug 01 '14 at 12:06
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/58487/discussion-between-skamazin-and-railsenthusiast). — skamazin, Aug 01 '14 at 12:07

the Tin Man · Accepted Answer · 2014-08-01T20:33:20.823

Don't use a regular expression to try to parse HTML. HTML can be expressed too many ways and still be valid, yet it will break your pattern and code.

The correct way to get the values for the parameters is to use a parser. Nokogiri is the defacto XML/HTML parser for Ruby:

require 'nokogiri'

doc = Nokogiri::HTML::DocumentFragment.parse(' <a class=\"direct_link\" rel=\"nofollow\" target=\"_blank\" href=\"http://google.com\">link text</a>')

That parses the document into a DOM and returns it.

link = doc.at('a')

at finds the first instance using the CSS 'a' selector. (If you want to iterate over them all you can use search, which returns a NodeSet, which is akin to an Array.)

At this point link is a Node, which we can consider to be like a pointer to the <a> tag.

link.to_h # => {"class"=>"\\\"direct_link\\\"", "rel"=>"\\\"nofollow\\\"", "target"=>"\\\"_blank\\\"", "href"=>"\\\"http://google.com\\\""}

That is the link's parameters and their values turned into a hash. Or, you can directly access the parameters, using keys, or their values:

link.values # => ["\\\"direct_link\\\"", "\\\"nofollow\\\"", "\\\"_blank\\\"", "\\\"http://google.com\\\""]
link.keys # => ["class", "rel", "target", "href"]

Or treat it like a hash and iterate over the key/value pairs:

link.each do |k, v|
  puts 'parameter: "%s" value: "%s"' % [k, v]
end
# >> parameter: "class" value: "\"direct_link\""
# >> parameter: "rel" value: "\"nofollow\""
# >> parameter: "target" value: "\"_blank\""
# >> parameter: "href" value: "\"http://google.com\""

The advantage to using the parser, is that the HTML format can change and the parser is still able to figure it out, and your code won't care. The following format works just as good as the tag used above:

doc = Nokogiri::HTML::DocumentFragment.parse(' <a 
  class=\"direct_link\" 
    rel=\"nofollow\" target=\"_blank\"
    href=\"http://google.com\">
    link text
    </a>')

Try doing that with a pattern.

skamazin · Answer 2 · 2014-08-01T12:06:57.783

0

Well if you want does the stuff in the quotes it would be this:

"([\w:\/.]+)\\"

Test it here

Otherwise if you want the name before the quotes it would be this:

(\w+=\\"[\w:\/.]+\\")

Test it here

This one matches tags without backslashes:

(\w+="[\w:\/.-]+")

Test it here

edited Aug 01 '14 at 12:06

answered Aug 01 '14 at 11:52

skamazin

757
5
12

Regular expression to determine each and every attribute of an anchor tag inside HTML content

A beautiful jQuery; WYSIWYG text editor

Froala WYSIWYG Editor is built on the latest technologies and according to the latest industry trends.

2 Answers2