3

I was playing around with the BeautifulSoup and Requests APIs today. So I thought I would write a simple scraper that would follow links to a depth of 2(if that makes sense). All the links in the webpage that i am scraping are relative. (For eg: <a href="/free-man-aman-sethi/books/9788184001341.htm" title="A Free Man">) So to make them absolute I thought I would join the page url with the relative links using urljoin.

To do this I had to first extract the href value from the <a> tags and for that I thought I would use split:

#!/bin/python
#crawl.py
import requests
from bs4 import BeautifulSoup
from urlparse import urljoin

html_source=requests.get("http://www.flipkart.com/books")
soup=BeautifulSoup(html_source.content)
links=soup.find_all("a")
temp=links[0].split('"')

This gives the following error:

Traceback (most recent call last):
  File "test.py", line 10, in <module>
    temp=links[0].split('"')
TypeError: 'NoneType' object is not callable

Having dived in before properly going through the documentation, I realize that this is probably not the best way to achieve my objective but why is there a TypeError?

Chaitanya Nettem
  • 1,209
  • 2
  • 23
  • 45

3 Answers3

6

links[0] is not a string, it's a bs4.element.Tag. When you try to look up split in it, it does its magic and tries to find a subelement named split, but there is none. You are calling that None.

In [10]: l = links[0]

In [11]: type(l)
Out[11]: bs4.element.Tag

In [17]: print l.split
None

In [18]: None()   # :)

TypeError: 'NoneType' object is not callable

Use indexing to look up HTML attributes:

In [21]: links[0]['href']
Out[21]: '/?ref=1591d2c3-5613-4592-a245-ca34cbd29008&_pop=brdcrumb'

Or get if there is a danger of nonexisting attributes:

In [24]: links[0].get('href')
Out[24]: '/?ref=1591d2c3-5613-4592-a245-ca34cbd29008&_pop=brdcrumb'


In [26]: print links[0].get('wharrgarbl')
None

In [27]: print links[0]['wharrgarbl']

KeyError: 'wharrgarbl'
Pavel Anossov
  • 60,842
  • 14
  • 151
  • 124
1

Because the Tag class uses proxying to access attributes (as Pavel points out - this is used to access child elements where possible), so where it's not found the None default is returned.

convoluted example:

>>> print soup.find_all('a')[0].bob
None
>>> print soup.find_all('a')[0].foobar
None
>>> print soup.find_all('a')[0].split
None

You need to use:

soup.find_all('a')[0].get('href')

Where:

>>> print soup.find_all('a')[0].get
<bound method Tag.get of <a href="test"></a>>
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
1

I just encountered the same error - so for what it's worth four years later: if you need to split up the soup element you can also use str() on it before you split it. In your case that would be:

    temp = str(links).split('"')
Ollie
  • 189
  • 1
  • 13