0

I'm trying to scrape tvtropes with beautifulsoup, but for some reason the data I want is cut out. I'm talking even when I return the entire "soup" from the page. The specific example is this website: http://tvtropes.org/pmwiki/pmwiki.php/Series/Firefly

I want to scrape all the tropes in the folders at the bottom. For some reason after "I was aimin' in the A-D folder under the Accidental Aiming Skills list item, it stops returning data from these folders. Then it prints out stuff in the . I'm doing everything right so I don't understand what the problem is. Does tvtropes not allow you to scrape the entire page for some reason?

def webcrawler(startingurl):
    request = urllib2.Request(startingurl)
    url = urllib2.urlopen(request)       
    soup = BeautifulSoup(url)
    print soup.prettify().encode('UTF-8')
    #this does the same thing
    for item in soup.findAll('a', {'class':'twikilink'}):
        if 'Main' in str(item):
           print item, '\n'

webcrawler("http://tvtropes.org/pmwiki/pmwiki.php/Series/" + 'Firefly')
  • So you're trying to scrape a site without knowing if scraping is allowed? – Anzel Nov 13 '14 at 19:05
  • I can't find any information on it – Austin Capobianco Nov 13 '14 at 19:06
  • Seems to work fine : `soup.find("div", id="folder0").findAll("a", {"class":"twikilink"})`, `[item for item in soup.find("div", id="folder0").findAll("a", {"class":"twikilink"}) if "Main" in item["href"]]`. –  Nov 13 '14 at 20:07

1 Answers1

1

try this,

pip install html5lib

and then edit your code to,

soup = BeautifulSoup(url,'html5lib')


out[]:
<a class="twikilink" href="http://tvtropes.org/pmwiki/pmwiki.php/Main/YouHaveToHaveJews" title="http://tvtropes.org/pmwiki/pmwiki.php/Main/YouHaveToHaveJews">You Have to Have Jews</a> 

<a class="twikilink" href="http://tvtropes.org/pmwiki/pmwiki.php/Main/YouMustBeCold" title="http://tvtropes.org/pmwiki/pmwiki.php/Main/YouMustBeCold">You Must Be Cold</a> 

<a class="twikilink" href="http://tvtropes.org/pmwiki/pmwiki.php/Main/YouRebelScum" title="http://tvtropes.org/pmwiki/pmwiki.php/Main/YouRebelScum">You Rebel Scum!</a> 
Md. Mohsin
  • 1,822
  • 3
  • 19
  • 34