1

Is there any way using urlib, urllib2 or BeautifulSoup to extract HTML tag attributes?

for example:

<a href="xyz" title="xyz">xyz</a>

gets href=xyz, title=xyz

There is another thread talking about using regular expressions

Thanks

Community
  • 1
  • 1
daydreamer
  • 87,243
  • 191
  • 450
  • 722
  • The docs of BeautifulSoup which you mention cover this quite thorooughly. If there's some specific aspect you're having trouble with, then you need to be more specific in your question. – Ross Patterson Aug 21 '11 at 22:04
  • 1
    possible duplicate of [How do I iterate over the HTML attributes of a Beautiful Soup element?](http://stackoverflow.com/questions/822571/how-do-i-iterate-over-the-html-attributes-of-a-beautiful-soup-element) – agf Aug 21 '11 at 22:06

2 Answers2

9

You could use BeautifulSoup to parse the HTML, and for each <a> tag, use tag.attrs to read the attributes:

In [111]: soup = BeautifulSoup.BeautifulSoup('<a href="xyz" title="xyz">xyz</a>')

In [112]: [tag.attrs for tag in soup.findAll('a')]
Out[112]: [[(u'href', u'xyz'), (u'title', u'xyz')]]
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
6

why don't you try with the HTMLParser module?

Something like this:

import HTMLParser
import urllib

class parseTitle(HTMLParser.HTMLParser):

    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for names, values in attrs:
                if name == 'href':
                    print value # or the code you need.
                if name == 'title':
                    print value # or the code you need.



aparser = parseTitle()
u = urllib.open('http://stackoverflow.com') # change the address as you like
aparser.feed(u.read())