python extracting HTML tag attributes without regular expressions

Question

Is there any way using urlib, urllib2 or BeautifulSoup to extract HTML tag attributes?

for example:

<a href="xyz" title="xyz">xyz</a>

gets href=xyz, title=xyz

There is another thread talking about using regular expressions

Thanks

The docs of BeautifulSoup which you mention cover this quite thorooughly. If there's some specific aspect you're having trouble with, then you need to be more specific in your question. — Ross Patterson, Aug 21 '11 at 22:04
possible duplicate of [How do I iterate over the HTML attributes of a Beautiful Soup element?](http://stackoverflow.com/questions/822571/how-do-i-iterate-over-the-html-attributes-of-a-beautiful-soup-element) — agf, Aug 21 '11 at 22:06

unutbu · Accepted Answer · 2011-08-22T00:32:06.150

9

You could use BeautifulSoup to parse the HTML, and for each <a> tag, use tag.attrs to read the attributes:

In [111]: soup = BeautifulSoup.BeautifulSoup('<a href="xyz" title="xyz">xyz</a>')

In [112]: [tag.attrs for tag in soup.findAll('a')]
Out[112]: [[(u'href', u'xyz'), (u'title', u'xyz')]]

edited Aug 22 '11 at 00:32

answered Aug 21 '11 at 22:04

unutbu

842,883
184
1,785
1,677

score 6 · Answer 2 · answered Aug 22 '11 at 19:25

why don't you try with the HTMLParser module?

Something like this:

import HTMLParser
import urllib

class parseTitle(HTMLParser.HTMLParser):

    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for names, values in attrs:
                if name == 'href':
                    print value # or the code you need.
                if name == 'title':
                    print value # or the code you need.



aparser = parseTitle()
u = urllib.open('http://stackoverflow.com') # change the address as you like
aparser.feed(u.read())

python extracting HTML tag attributes without regular expressions

2 Answers2

Linked