0

I am using Scrapy plus selenium to scrapy data from dynamic pages.here is my spider code:

class asbaiduSpider(CrawlSpider):
        name = 'apps_v3'
        start_urls = ["http://as.baidu.com/a/software?f=software_1012_1"]

        rules = (Rule(SgmlLinkExtractor(allow=("cid=(50[0-9]|510)&s=1&f=software_1012_1", )), callback='parse_item',follow=True),) 

        def __init__(self):
                CrawlSpider.__init__(self)
                chromedriver = "/usr/bin/chromedriver"
                os.environ["webdriver.chrome.driver"] = chromedriver
                self.driver = webdriver.Chrome(chromedriver)

        def __del__(self):
                self.driver.stop()
                CrawlSpider.__del__(self)

        def parse_item(self,response):
                hxs = Selector(response)
                #links= hxs.xpath('//span[@class="tit"]/text()').extract()
                links= hxs.xpath('//a[@class="hover-link"]/@href').extract()
                for link in links:
                        #print 'link:\t%s'%link
                        time.sleep(2)
                        return Request(link,callback=self.parse_page)

        def parse_page(self,response):
                self.driver.get(response.url)
                time.sleep(2.5)

                app_comments = ''
                num = len(self.driver.find_elements_by_xpath("//section[@class='s-index-page devidepage']/a"))
                print 'num:\t%s'%num
                if num == 8:
                        print 'num====8 ohohoh'
                        while True:
                                link = self.driver.find_element_by_link_text('下一页')
                                try:
                                        link.click()
                                except:
                                        break

The problem is, everytime after clicking page2, it just quit the current page. But I need to crawl page3, page4 and so on. the pages need to parse are like : http://as.baidu.com/a/item?docid=5302381&pre=web_am_software&pos=software_1012_0&f=software_1012_0 (it's in Chinese, sorry for the inconvenience) And I need to turn the bottom pages and scrape the comment data. I have been stuck with the problem for 2 days. I really appreciate for any help. Thank you...

Carl
  • 26,500
  • 4
  • 65
  • 86
talisa
  • 3
  • 2

1 Answers1

0

If I have understood it correct here is your case

  1. Open a page
  2. Find some links from the page and visit them one by one
  3. While visiting each link extract data.

If my understanding is correct. I think you can proceed with below logic.

  1. Open the page
  2. Get all the links and save them to an array.
  3. Now open each page separately using the webdriver and do your job.
A Paul
  • 8,113
  • 3
  • 31
  • 61
  • Please select the answer as correct if it helped you. It is the way to appreciate others effort to help you at here :) – A Paul Jan 08 '14 at 08:22
  • In fact, I need this: (1) open a page(2) find some link(let's say link2) from the page and visit them (3) while visiting each page from link2 , find some links(let's say link3) and follow them (4)while visiting pages from link3, i need to click some button (while the url remains the same) and scrapy some data .@A Paul – talisa Jan 08 '14 at 08:24
  • when i tryed to directly scrape the finally page,it runs well. After 2 layers of jump, while scraping the finally page , it can't continue after click the second button . I think there maybe some error in parse_page func. But I don't know how to correct。 – talisa Jan 08 '14 at 08:31
  • @talisa - For your problem just write a logic and use proper java collection to store the urls so that you can get back to any page. Its all about logic for you now. Regarding the function "parse_page", if possible try to get the element by xpath or style or cssSelectior or name(if possible) insteaad of "self.driver.find_element_by_link_text" - Link text in chinise. But this is just a thought every thing else looks fine. Else just debug your code. – A Paul Jan 08 '14 at 08:36
  • @A Paul Unfortunately I am not familiar with java. Is there a way to solve it in python? – talisa Jan 08 '14 at 09:45
  • Please google for "java collection equivalent in python", I am sure you will get lots of help. I was just telling you the logic. How to implement that logic you have to find out. Google it you will definitely get it :) – A Paul Jan 08 '14 at 10:36