Extracting Text from Parsed HTML with Python

Question

I'm new to Python and I have been trying to search through html with regular expressions that has been parsed with BeautifulSoup. I haven't had any success and I think the reason is that I don't completely understand how to set up the regular expressions properly. I've looked at older questions about similar problems but I still haven't figured it out. If somebody could extract the "/torrent/32726/0/" and "Slackware Linux 13.0 [x86 DVD ISO]" as well as a detailed expression of how the regular expression works, it would be really helpful.

<td class="name">
  <a href="/torrent/32726/0/">
   Slackware Linux 13.0 [x86 DVD ISO]
  </a>
 </td>

Edit: What I meant to say is, I am trying to extract "/torrent/32726/0/" and "Slackware Linux 13.0 [x86 DVD ISO]" using BeautifulSoups functions to search the parse tree. I've been trying various things after searching and reading the documentation, but I'm still not sure on how to go about it.

Now they use parsers and still want to use regexes o.O What do you want, extract the contents of anchors with a href starting with `/torrent/`? You have to walk the parse tree. You can use regexes to whether the current node is what you want, but you have to walk the tree the parser built. — , Aug 26 '10 at 13:17
I guess I was using the wrong terminology. You are right, I want to take that parse tree that BeautifulSoup generates, and I want to extract "/torrent/32726/0/" and "Slackware Linux 13.0 [x86 DVD ISO]" and store them in their own dictionary. — FlowofSoul, Aug 26 '10 at 13:24

systempuntoout · Accepted Answer · 2010-08-26T16:33:17.620

BeautifulSoup could also extract node values from your html.

from BeautifulSoup import BeautifulSoup

html = ('<html><head><title>Page title</title></head>'
       '<body>'
       '<table><tr>'
       '<td class="name"><a href="/torrent/32726/0/">Slackware Linux 13.0 [x86 DVD ISO]</a></td>'
       '<td class="name"><a href="/torrent/32727/0/">Slackware Linux 14.0 [x86 DVD ISO]</a></td>'
       '<td class="name"><a href="/torrent/32728/0/">Slackware Linux 15.0 [x86 DVD ISO]</a></td>'
       '</tr></table>'
       'body'
       '</html>')
soup = BeautifulSoup(html)
links = [td.find('a') for td in soup.findAll('td', { "class" : "name" })]
for link in links:
    print link.string

Output:

Slackware Linux 13.0 [x86 DVD ISO]  
Slackware Linux 14.0 [x86 DVD ISO]  
Slackware Linux 15.0 [x86 DVD ISO]

Hey you never used the re module ¬¬ – razpeitia Aug 26 '10 at 15:01 — razpeitia, Aug 26 '10 at 15:01

score 2 · Answer 2 · answered Aug 26 '10 at 15:05

You could use lxml.html to parse the html document:

from lxml import html

doc = html.parse('http://example.com')

for a in doc.cssselect('td a'):
    print a.get('href')
    print a.text_content()

You will have to look at how the document is structured to find the best way of determining the links you want (there might be other tables with links in them that you do not need etc...): you might first want to find the right table element for instance. There are also options besides css selectors (xpath for example) to search the document/the element.

If you need, you can turn the links into absolute links with .make_links_absolute() method (do it on the document after parsing, and all the url's will be absolute, very convenient)

Extracting Text from Parsed HTML with Python

2 Answers2

Linked