1

Below is a Scrapy spider I have put together to pull some elements from a web page. I borrowed this solution from another Stack Overflow solution. It works, but I need more. I need to be able to walk the series of pages specified in the for loop inside the start_requests method after authenticating.

Yes, I did locate the Scrapy documentation discussing this along with a previous solution for something very similar. Neither one seems to make much sense. From what I can gather, I need to somehow create a request object and keep passing it along, but cannot seem to figure out how to do this.

Thank you in advance for your help.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import re

class MyBasicSpider(BaseSpider):
    name = "awBasic"
    allowed_domains = ["americanwhitewater.org"]

    def start_requests(self):
        '''
        Override BaseSpider.start_requests to crawl all reaches in series
        '''
        # for every integer from one to 5000
        for i in xrange(1, 50): # 1 to 50 for testing

            # convert to string
            iStr = str(i)

            # add leading zeros to get to four digit length
            while len(iStr) < 4:
                iStr = '0{0}'.format(iStr)

            # call make requests
            yield self.make_requests_from_url('https://mycrawlsite.com/{0}/'.format(iStr))

    def parse(self, response):

        # create xpath selector object instance with response
        hxs = HtmlXPathSelector(response)

        # get part of url string
        url = response.url
        id = re.findall('/(\d{4})/', url)[0]

        # selector 01
        attribute01 = hxs.select('//div[@id="block_1"]/text()').re('([^,]*)')[0]

        # selector for river section
        attribute02 = hxs.select('//div[@id="block_1"]/div[1]/text()').extract()[0]

        # print results
        print('\tID: {0}\n\tAttr01: {1}\n\tAttr02: {2}').format(reachId, river, reachName) 
Community
  • 1
  • 1
knu2xs
  • 910
  • 2
  • 11
  • 23

1 Answers1

3

You may have to approach the problem from a different angle:

  • first of all, scrape the main page; it contains a login form, so you can use FormRequest to simulate a user login; your parse method will likely look something like this:

    def parse(self, response):
        return [FormRequest.from_response(response,
                    formdata={'username': 'john', 'password': 'secret'},
                    callback=self.after_login)]
    
  • in after_login you check if the authentication was successful, usually by scanning the response for error messages; if all went well and you're logged in, you can start generating requests for the pages you're after:

    def after_login(self, response):
        if "Login failed" in response.body:
            self.log("Login failed", level=log.ERROR)
        else:
            for i in xrange(1, 50): # 1 to 50 for testing
                # convert to string
                iStr = str(i)
    
                # add leading zeros to get to four digit length
                while len(iStr) < 4:
                    iStr = '0{0}'.format(iStr)
    
                # call make requests
                yield Request(url='https://mycrawlsite.com/{0}/'.format(iStr),
                              callback=self.scrape_page)
    
  • scrape_page will be called with each of the pages you created a request for; there you can finally extract the information you need using XPath, regex, etc.

BTW, you shouldn't 0-pad numbers manually; format will do it for you if you use the right format specifier.

amgaera
  • 1,838
  • 1
  • 15
  • 15
  • Thank you so much for the assistance. It took a bit of tinkering to get it all wired up and working, but now it is humming smoothly. The only real missing component was needing to re-add the start url at the top and import statements for `Request` and `FormRequest`. This likely is obvious enough, but it took me a few minutes to figure out. Also, thank you for the hint on using `format`. It did not take very long to find the documentation and examples on use. – knu2xs Nov 11 '13 at 20:18