I am following this topic to extract content from a website which has authentication. I have two versions of code, first one likes below
class FoodCrawler(InitSpider):
def parse(self, response):
pass
name = "theCrawler"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com"]
login_page = 'http://example.com/login'
def __init__(self, user, password, *args, **kwargs):
super(FoodCrawler, self).__init__(*args, **kwargs)
self.password = password
self.user = user
msg = 'The account will be used ' + user + ' ' + password
self.log(msg, level=logging.INFO)
def init_request(self):
"""This function is called before crawling starts."""
msg = {'email': self.user, 'password': self.password,
'reCaptchaResponse': '', 'rememberMe': 'true'}
headers = {'X-Requested-With': 'XMLHttpRequest',
'Content-Type': 'application/json'}
yield Request(self.login_page, method='POST', body=json.dumps(msg), headers=headers,
callback=self.check_login_response)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if json.loads(response.body)['isSuccess']:
self.log("Successfully logged in!")
self.initialized(response)
else:
self.log("Bad times :(")
def initialized(self, response=None):
self.log("initialized")
for url in self.start_urls:
yield self.make_requests_from_url(url)
in the second version, I just change initialized function, remaining is similar
def initialized(self, response=None):
self.log("initialized")
The difference is 1st version may embrace more functions while second one doesn't. Please see more (*) for your details. To demonstrate, please take a look at self.log("initialized"), I want to show 1st version doesn't work properly. Thus, when I run, first version can't show message DEBUG: initialized
as self.log("initialized")
as the second version does
The full log is yielded by first version was
2016-01-05 16:05:38 [scrapy] INFO: Scrapy 1.0.3 started (bot: MySpider)
2016-01-05 16:05:38 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-01-05 16:05:38 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'MySpider.spiders', 'SPIDER_MODULES': ['MySpider.spiders'], 'CONCURRENT_REQUESTS': 4, 'BOT_NAME': 'MySpider'}
2016-01-05 16:05:39 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-01-05 16:05:39 [theCrawler] INFO: The account will be used username@gmail.com 123456789
2016-01-05 16:05:39 [py.warnings] WARNING: /usr/lib/python2.7/site-packages/scrapy/utils/deprecate.py:155: ScrapyDeprecationWarning: `scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware` class is deprecated, use `scrapy.downloadermiddlewares.useragent.UserAgentMiddleware` instead
ScrapyDeprecationWarning)
2016-01-05 16:05:39 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, RotateUserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-01-05 16:05:39 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-01-05 16:05:39 [scrapy] INFO: Enabled item pipelines:
2016-01-05 16:05:39 [scrapy] INFO: Spider opened
2016-01-05 16:05:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-01-05 16:05:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-05 16:05:39 [scrapy] DEBUG: Crawled (200) <POST http://www.example.com/login> (referer: None)
2016-01-05 16:05:39 [theCrawler] DEBUG: Successfully logged in!
2016-01-05 16:05:39 [scrapy] INFO: Closing spider (finished)
2016-01-05 16:05:39 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 494,
'downloader/request_count': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 1187,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 1, 5, 9, 5, 39, 363402),
'log_count/DEBUG': 3,
'log_count/INFO': 8,
'log_count/WARNING': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 1, 5, 9, 5, 39, 168955)}
2016-01-05 16:05:39 [scrapy] INFO: Spider closed (finished)
I would like to know why, could you please give any advice? Thank you in advance
[Updated]
import json, pdb, logging
from scrapy import Request
from scrapy.spiders.init import InitSpider
(*) the initialized function embraces more function such as self.my_requests() but this function doesn't work. Indeed, the script doesn't run into self.my_requests()
def initialized(self, response=None):
self.log("initialized")
self.my_requests()
def my_requests(self):
self.log("my_requests")
pdb.set_trace()
for url in self.start_urls:
yield self.make_requests_from_url(url)