5

Im trying to parse a list of video game titles from a shopping site. however as the item list is all stored inside a tag .

This section of the documentation supposedly explains how to parse only part of the document but i cant work it out. my code:

from BeautifulSoup import BeautifulSoup
import urllib
import re

url = "Some Shopping Site"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
for a in soup.findAll('a',{'title':re.compile('.+') }):
    print a.string

at present is prints the string inside any tag that has a not empty title reference. but it is also priting the items in the side bar that are the "specials". if i can only take the product list div, i will kill 2 birds with one stone.

Many thanks.

Community
  • 1
  • 1
Scraper
  • 181
  • 1
  • 1
  • 5

2 Answers2

13

Oh boy am i silly, i was searching for tags with atribute id = products, but it should have been product_list

heres the finaly code if anyone comes searching.

from BeautifulSoup import BeautifulSoup, SoupStrainer
import urllib
import re


start = time.clock()
url = "http://someplace.com"
html = urllib.urlopen(url).read()
product = SoupStrainer('div',{'id': 'products_list'})
soup = BeautifulSoup(html,parseOnlyThese=product)
for a in soup.findAll('a',{'title':re.compile('.+') }):
      print a.string
Scraper
  • 181
  • 1
  • 1
  • 5
0

Try searching first for the product list div and then for the a tags with title:

product = soup.find('div',{'id': 'products'})
for a in product.findAll('a',{'title': re.compile('.+') }):
   print a.string
dusan
  • 9,104
  • 3
  • 35
  • 55
  • tried that but it gave this error: Traceback (most recent call last): File "~/start.py", line 11, in for a in product.findAll('a',{'title':re.compile('.+') }): AttributeError: 'ResultSet' object has no attribute 'findAll' – Scraper Oct 24 '10 at 00:24
  • Try calling `soup.find` instead of `soup.findAll`. – dusan Oct 24 '10 at 00:30
  • not its giving me this, Traceback (most recent call last): File "~/src/start.py", line 13, in for a in product.findAll('a',{'title':re.compile('.+') }): AttributeError: 'NoneType' object has no attribute 'findAll' – Scraper Oct 24 '10 at 03:29
  • ok i tried to implement the strainer and this is what i got, but it does print anything(sorry not sure how to do new lines in a comment) url = "somelink" html = urllib.urlopen(url).read() product = SoupStrainer('div',{'id': 'products'}) soup = BeautifulSoup(html,parseOnlyThese=product) for a in soup.findAll('a',{'title':re.compile('.+') }): print a.string – Scraper Oct 24 '10 at 03:46