1

I am following this topic to extract content from a website which has authentication. I have two versions of code, first one likes below

class FoodCrawler(InitSpider):
    def parse(self, response):
        pass

    name = "theCrawler"
    allowed_domains = ["example.com"]
    start_urls = ["http://www.example.com"]
    login_page = 'http://example.com/login'


    def __init__(self, user, password, *args, **kwargs):
        super(FoodCrawler, self).__init__(*args, **kwargs)
        self.password = password
        self.user = user
        msg = 'The account will be used ' + user + ' ' + password
        self.log(msg, level=logging.INFO)

    def init_request(self):
        """This function is called before crawling starts."""
        msg = {'email': self.user, 'password': self.password,
                    'reCaptchaResponse': '', 'rememberMe': 'true'}
        headers = {'X-Requested-With': 'XMLHttpRequest',
                   'Content-Type': 'application/json'}
        yield Request(self.login_page, method='POST', body=json.dumps(msg), headers=headers,
                       callback=self.check_login_response)

    def check_login_response(self, response):
        """Check the response returned by a login request to see if we are
        successfully logged in.
        """
        if json.loads(response.body)['isSuccess']:
            self.log("Successfully logged in!")
            self.initialized(response)
        else:
            self.log("Bad times :(")

    def initialized(self, response=None):
        self.log("initialized")
        for url in self.start_urls:
            yield self.make_requests_from_url(url)

in the second version, I just change initialized function, remaining is similar

def initialized(self, response=None):
        self.log("initialized")

The difference is 1st version may embrace more functions while second one doesn't. Please see more (*) for your details. To demonstrate, please take a look at self.log("initialized"), I want to show 1st version doesn't work properly. Thus, when I run, first version can't show message DEBUG: initialized as self.log("initialized") as the second version does

The full log is yielded by first version was

2016-01-05 16:05:38 [scrapy] INFO: Scrapy 1.0.3 started (bot: MySpider)
2016-01-05 16:05:38 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-01-05 16:05:38 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'MySpider.spiders', 'SPIDER_MODULES': ['MySpider.spiders'], 'CONCURRENT_REQUESTS': 4, 'BOT_NAME': 'MySpider'}
2016-01-05 16:05:39 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-01-05 16:05:39 [theCrawler] INFO: The account will be used username@gmail.com 123456789
2016-01-05 16:05:39 [py.warnings] WARNING: /usr/lib/python2.7/site-packages/scrapy/utils/deprecate.py:155: ScrapyDeprecationWarning: `scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware` class is deprecated, use `scrapy.downloadermiddlewares.useragent.UserAgentMiddleware` instead
  ScrapyDeprecationWarning)

2016-01-05 16:05:39 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, RotateUserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-01-05 16:05:39 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-01-05 16:05:39 [scrapy] INFO: Enabled item pipelines: 
2016-01-05 16:05:39 [scrapy] INFO: Spider opened
2016-01-05 16:05:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-01-05 16:05:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-05 16:05:39 [scrapy] DEBUG: Crawled (200) <POST http://www.example.com/login> (referer: None)
2016-01-05 16:05:39 [theCrawler] DEBUG: Successfully logged in!
2016-01-05 16:05:39 [scrapy] INFO: Closing spider (finished)
2016-01-05 16:05:39 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 494,
 'downloader/request_count': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 1187,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 1, 5, 9, 5, 39, 363402),
 'log_count/DEBUG': 3,
 'log_count/INFO': 8,
 'log_count/WARNING': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2016, 1, 5, 9, 5, 39, 168955)}
2016-01-05 16:05:39 [scrapy] INFO: Spider closed (finished)

I would like to know why, could you please give any advice? Thank you in advance

[Updated]

import json, pdb, logging
from scrapy import Request
from scrapy.spiders.init import InitSpider

(*) the initialized function embraces more function such as self.my_requests() but this function doesn't work. Indeed, the script doesn't run into self.my_requests()

   def initialized(self, response=None):
        self.log("initialized")
        self.my_requests()

    def my_requests(self):
        self.log("my_requests")
        pdb.set_trace()
        for url in self.start_urls:
            yield self.make_requests_from_url(url)
Community
  • 1
  • 1
Bryan
  • 1,477
  • 1
  • 21
  • 38

0 Answers0