4

I am using Beautiful Soup to get hyperlinks in the body of web pages. Here is the code I use

import urllib2
from bs4 import BeautifulSoup

url = 'http://www.1914-1918.net/swb.htm'
element = 'body'
request = urllib2.Request(url)
page = urllib2.urlopen(request).read()
pageSoup = BeautifulSoup(page)
for elementSoup in pageSoup.find_all(element):
  for linkSoup in elementSoup.find_all('a'):
    print linkSoup['href']

I got an AttributeError when I tried to find hyperlinks for the swb.htm page.

AttributeError: 'NoneType' object has no attribute 'next_element'

I am sure that there are a body element and a couple of 'a' elements under the body element. But strangely it works well for other pages (e.g. http://www.1914-1918.net/1div.htm).

This problem has been haunting me for days. Can anyone please point out what I did wrong.

Screenshot

enter image description here

Community
  • 1
  • 1
WeimusT
  • 41
  • 1
  • 5
  • 1
    I don't understand. Post-edit, your code reflects @Hal's answer. Exactly which is your code, this one post-edit or the one pre-edit? – WGS Apr 16 '14 at 16:19
  • I post-edited my code. The print problem @Hal pointed out was a typo. Sorry for all the confusions. – WeimusT Apr 16 '14 at 16:22
  • Kindly check if you're using the latest BeautifulSoup release and Python 2.7.6. I am getting a boatload of links on this without problems. I can see in your screenshot that you have Python 2.7, but humor us and try checking if it's 2.7.5+. :) – WGS Apr 16 '14 at 16:47
  • I am using Ubuntu 12.04 and Python 2.7.3. I guess this could be the reason for this problem. Don't want to risk breaking dependencies to upgrade to 2.7.6. Any other solutions maybe? – WeimusT Apr 16 '14 at 22:08
  • I am trying this out in Python 3 and it worked very well. Cheers – WeimusT Apr 16 '14 at 22:22
  • You should be using virtualenv when running Python on Ubuntu btw. I use Ubuntu 13.10, and both with a virtual environment of 2.7.6 and the built-in 2.7.5+ system Python, this works well. Weird that it should work for 3.x for you. Oh well. Good luck. :) – WGS Apr 16 '14 at 22:57
  • I believe this is still an open bug in BeautifulSoup: https://bugs.launchpad.net/beautifulsoup/+bug/1270611 – Garrett Feb 29 '16 at 06:20

2 Answers2

1

This happens when you have the html5lib installed.

Just try remove it and test again.

More details: https://bugs.launchpad.net/beautifulsoup/+bug/1184417

-1

Maybe the beautifulsoup4 is not fit your Python, try removing beautifulsoup4: pip uninstall beautifulsoup4, and install the older version: pip install beautifulsoup4==<version>, I use the version 4.1.3.

Martin Tournoij
  • 26,737
  • 24
  • 105
  • 146
LeonPak
  • 1
  • 1