2

I need to use my function parsePage as the callback to request links I crawled from the website. However, the request is sent only once to the first link, and I got no response.

Here is my code:

class diploma(CrawlSpider):
name = "diploma"
allowed_domains="pikabu.ru"
start_urls = [
    "https://pikabu.ru/hot"
]
def parse(self, response):
    for sel in response.xpath("//div[@class='stories-feed__container']/article[@class='story']"):
        item = DiplomaItem()
        item['MainPageUrl'] = "https://pikabu.ru"+sel.xpath('div[2]/header[@class="story__header"]/h2/a/@href').extract()[0]

        request = scrapy.Request(item['MainPageUrl'], callback=self.parsePage)
        request.meta['item'] = item
        yield request


def parsePage(self, response):
    print("hHAHAHAHAH")
    item = response.meta['item']
    return item

Here are the logs:

    2018-03-15 18:11:26 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: diploma)
2018-03-15 18:11:26 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 2.7.10 (default, Feb  7 2017, 00:08:15) - [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0g  2 Nov 2017), cryptography 2.1.4, Platform Darwin-16.7.0-x86_64-i386-64bit
2018-03-15 18:11:26 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'diploma.spiders', 'SPIDER_MODULES': ['diploma.spiders'], 'CONCURRENT_REQUESTS': 250, 'DOWNLOAD_DELAY': 5, 'BOT_NAME': 'diploma'}
2018-03-15 18:11:26 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2018-03-15 18:11:26 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-03-15 18:11:26 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-03-15 18:11:26 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-03-15 18:11:26 [scrapy.core.engine] INFO: Spider opened
2018-03-15 18:11:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-15 18:11:26 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-03-15 18:11:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pikabu.ru/hot> (referer: None)
~~~
/story/kak_pogoda_50_na_50_5777191
2018-03-15 18:11:27 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'pikabu.ru': <GET https://pikabu.ru/story/kak_pogoda_50_na_50_5777191>
~~~
/story/chto_mozhet_poyti_ne_tak_5773824
~~~
/story/strannyiy_chelovek_5777133
~~~
/story/kak_ya_zabiral_ayfon_s_pochtyi_rossii_ili_khitryie_kitaytsyi_5776835
~~~
/story/kopirayterskie_slozhnosti_ch14_5776220
~~~
/story/novyiy_televizor_samsung_mozhet_slivatsya_s_poverkhnostyu_5775567
~~~
/story/neobyichnyiy_vkhod_v_podezd_5767500
~~~
/story/muzhchina_khotel_brosit_rabotu_chtobyi_ukhazhivat_za_bolnyim_rakom_syinom_no_kollegi_otrabotali_za_nego_3300_chasov_5770070
~~~
/story/kak_ya_uchilsya_khodit_5776376
~~~
/story/zabavnoe_dialogi_s_zakazchikami_5_5777655
~~~
/story/pro_metallurga_iz_magnitogorska_snyali_yepichnuyu_korotkometrazhku_5774307
~~~
/story/lovkost_ruk_i_nikakogo_moshennichestva_5777007
~~~
/story/kogda_nashelsya_novyiy_sponsor_5769282
~~~
/story/nikto_ne_chitaet_kharakteristiki_5771821
~~~
/story/posmotrite_na_yeti_shedevryi_5777462
2018-03-15 18:11:27 [scrapy.core.engine] INFO: Closing spider (finished)
2018-03-15 18:11:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 211,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 39452,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 3, 15, 12, 11, 27, 860712),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'memusage/max': 46387200,
 'memusage/startup': 46387200,
 'offsite/domains': 1,
 'offsite/filtered': 21,
 'request_depth_max': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 3, 15, 12, 11, 26, 740826)}
2018-03-15 18:11:27 [scrapy.core.engine] INFO: Spider closed (finished)

As you can see, the callback function parsePage is not invoked after requesting it. Also in the logs, we can see that there are about 20 links (printing not shown in code), but the request is sent only to the first one and only once. Why?

Tom de Geus
  • 5,625
  • 2
  • 33
  • 77
Konstantin
  • 129
  • 4
  • 16

1 Answers1

0

Add this in your Code

allowed_domains = ["pikabu.ru"]

For more information read this

For your Links, try to do this, it's better than what you did

link = urljoin('pikabu.ru',link)

For mor information read this

and add this to your Request

dont_filter = True 

dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.

parik
  • 2,313
  • 12
  • 39
  • 67