I need to use my function parsePage
as the callback to request links I crawled from the website. However, the request is sent only once to the first link, and I got no response.
Here is my code:
class diploma(CrawlSpider):
name = "diploma"
allowed_domains="pikabu.ru"
start_urls = [
"https://pikabu.ru/hot"
]
def parse(self, response):
for sel in response.xpath("//div[@class='stories-feed__container']/article[@class='story']"):
item = DiplomaItem()
item['MainPageUrl'] = "https://pikabu.ru"+sel.xpath('div[2]/header[@class="story__header"]/h2/a/@href').extract()[0]
request = scrapy.Request(item['MainPageUrl'], callback=self.parsePage)
request.meta['item'] = item
yield request
def parsePage(self, response):
print("hHAHAHAHAH")
item = response.meta['item']
return item
Here are the logs:
2018-03-15 18:11:26 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: diploma)
2018-03-15 18:11:26 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 2.7.10 (default, Feb 7 2017, 00:08:15) - [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0g 2 Nov 2017), cryptography 2.1.4, Platform Darwin-16.7.0-x86_64-i386-64bit
2018-03-15 18:11:26 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'diploma.spiders', 'SPIDER_MODULES': ['diploma.spiders'], 'CONCURRENT_REQUESTS': 250, 'DOWNLOAD_DELAY': 5, 'BOT_NAME': 'diploma'}
2018-03-15 18:11:26 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2018-03-15 18:11:26 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-03-15 18:11:26 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-03-15 18:11:26 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-03-15 18:11:26 [scrapy.core.engine] INFO: Spider opened
2018-03-15 18:11:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-15 18:11:26 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-03-15 18:11:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pikabu.ru/hot> (referer: None)
~~~
/story/kak_pogoda_50_na_50_5777191
2018-03-15 18:11:27 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'pikabu.ru': <GET https://pikabu.ru/story/kak_pogoda_50_na_50_5777191>
~~~
/story/chto_mozhet_poyti_ne_tak_5773824
~~~
/story/strannyiy_chelovek_5777133
~~~
/story/kak_ya_zabiral_ayfon_s_pochtyi_rossii_ili_khitryie_kitaytsyi_5776835
~~~
/story/kopirayterskie_slozhnosti_ch14_5776220
~~~
/story/novyiy_televizor_samsung_mozhet_slivatsya_s_poverkhnostyu_5775567
~~~
/story/neobyichnyiy_vkhod_v_podezd_5767500
~~~
/story/muzhchina_khotel_brosit_rabotu_chtobyi_ukhazhivat_za_bolnyim_rakom_syinom_no_kollegi_otrabotali_za_nego_3300_chasov_5770070
~~~
/story/kak_ya_uchilsya_khodit_5776376
~~~
/story/zabavnoe_dialogi_s_zakazchikami_5_5777655
~~~
/story/pro_metallurga_iz_magnitogorska_snyali_yepichnuyu_korotkometrazhku_5774307
~~~
/story/lovkost_ruk_i_nikakogo_moshennichestva_5777007
~~~
/story/kogda_nashelsya_novyiy_sponsor_5769282
~~~
/story/nikto_ne_chitaet_kharakteristiki_5771821
~~~
/story/posmotrite_na_yeti_shedevryi_5777462
2018-03-15 18:11:27 [scrapy.core.engine] INFO: Closing spider (finished)
2018-03-15 18:11:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 211,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 39452,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 3, 15, 12, 11, 27, 860712),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'memusage/max': 46387200,
'memusage/startup': 46387200,
'offsite/domains': 1,
'offsite/filtered': 21,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2018, 3, 15, 12, 11, 26, 740826)}
2018-03-15 18:11:27 [scrapy.core.engine] INFO: Spider closed (finished)
As you can see, the callback function parsePage
is not invoked after requesting it. Also in the logs, we can see that there are about 20 links (printing not shown in code), but the request is sent only to the first one and only once. Why?