0

Hello! I have this script:

URL = "http://www.hitmeister.de/"

page = urllib2.urlopen(URL).read()
soup = BeautifulSoup(page)

links = soup.findAll('a')

for link in links:
    print link['href']

This should get links from the web page but it does not, what can be the problem? I have tried with User-Agent headers too, there is no result, but this script works for other web pages.

user873286
  • 7,799
  • 7
  • 30
  • 38
  • May you want to take a look at the scripts in this page: http://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautiful-soup –  May 07 '12 at 11:57
  • Tried your script, it works for me after adding the relevant imports (`from bs4 import BeautifulSoup` and `import urllib2`). Which version of the BS you're using? – ev-br May 07 '12 at 12:01
  • I am using BeautifulSoup 3.2.0-2build1, tried install bs4 and did not work – user873286 May 07 '12 at 12:07

2 Answers2

3

There's a really nice error message from BeautifulSoup. Did you read it and follow it's advice?

/Library/Python/2.7/site-packages/bs4/builder/_htmlparser.py:149: RuntimeWarning: Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help.

"Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help."))

Traceback (most recent call last):

File "", line 1, in

File "/Library/Python/2.7/site-packages/bs4/init.py", line 172, in init self._feed()

File "/Library/Python/2.7/site-packages/bs4/init.py", line 185, in _feed self.builder.feed(self.markup)

File "/Library/Python/2.7/site-packages/bs4/builder/_htmlparser.py", line 150, in feed raise e

HTMLParser.HTMLParseError: malformed start tag, at line 57, column 872

jayeff
  • 1,689
  • 14
  • 14
0
import urllib
import lxml.html
import urlparse

def get_dom(url):
    connection = urllib.urlopen(url)
    return lxml.html.fromstring(connection.read())

def get_links(url):
    return resolve_links((link for link in get_dom(url).xpath('//a/@href')))

def guess_root(links):
    for link in links:
        if link.startswith('http'):
            parsed_link = urlparse.urlparse(link)
            scheme = parsed_link.scheme + '://'
            netloc = parsed_link.netloc
            return scheme + netloc

def resolve_links(links):
    root = guess_root(links)
    for link in links:
        if not link.startswith('http'):
            link = urlparse.urljoin(root, link)
        yield link


for link in get_links('http://www.google.com'):
    print link
Ricky Wilson
  • 3,187
  • 4
  • 24
  • 29