0

Here is my code for the scraper that is extracting the URL and corresponding comments from that particular page:

import scraperwiki
import lxml.html
from BeautifulSoup import BeautifulSoup
import urllib2
import re

for num in range(1,2):
    html_page = urllib2.urlopen("https://success.salesforce.com/ideaSearch?keywords=error&pageNo="+str(num))
    soup = BeautifulSoup(html_page)
    for i in range(0,10):
        for link in soup.findAll('a',{'id':'search:ForumLayout:searchForm:itemObj2:'+str(i)+':idea:recentIdeasComponent:profileIdeaTitle'}):
             pageurl = link.get('href')
             html = scraperwiki.scrape(pageurl)
             root = lxml.html.fromstring(html)

             for j in range(0,300):
                 for table in root.cssselect("span[id='ideaView:ForumLayout:ideaViewForm:cmtComp:ideaComments:cmtLoop:"+str(j)+":commentBodyOutput'] table"):
                     divx = table.cssselect("div[class='htmlDetailElementDiv']")
                     if len(divx)==1:
                         data = {
                             'URL' : pageurl,
                             'Comment' : divx[0].text_content()
                         }
                         print data


         scraperwiki.sqlite.save(unique_keys=['URL'], data=data)
         scraperwiki.sqlite.save(unique_keys=['Comment'], data=data)

When the data is saved to the scraperwiki datastore only the last comment from one URL is put into the table. What I would like is in the table for each URL to have all the comments saved. So, in one column there is the URL and in the second column there are all the comments from that URL, instead of just the last comment, which is what this code ends up with.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129

1 Answers1

0

As I can see from your code, you put the data in the most inner for loop and assign it a new value every time. So when the for loop ends and goes to the save step, data will contain the last comment. I think you may use:

for i in range(0,10):
        for link in soup.findAll('a',{'id':'search:ForumLayout:searchForm:itemObj2:'+str(i)+':idea:recentIdeasComponent:profileIdeaTitle'}):
             pageurl = link.get('href')
             html = scraperwiki.scrape(pageurl)
             root = lxml.html.fromstring(html)
             data = {'URL': pageurl, 'Comment':[]}

             for j in range(0,300):
                 for table in root.cssselect("span[id='ideaView:ForumLayout:ideaViewForm:cmtComp:ideaComments:cmtLoop:"+str(j)+":commentBodyOutput'] table"):
                     divx = table.cssselect("div[class='htmlDetailElementDiv']")
                     if len(divx)==1:
                         data['Comment'].append(divx[0].text_content)

         scraperwiki.sqlite.save(unique_keys=['URL'], data=data)
         scraperwiki.sqlite.save(unique_keys=['Comment'], data=data)
zhangyangyu
  • 8,520
  • 2
  • 33
  • 43
  • for some reason when I run this code in scraperwiki, I get this error: File "./code/scraper", line 25 scraperwiki.sqlite.save(unique_keys=['URL'], data=data) ^ IndentationError: unindent does not match any outer indentation level – user2662750 Aug 08 '13 at 14:17