How to send cookie with scrapy CrawlSpider requests?

Question

I am trying to create this Reddit scraper using Python's Scrapy framework.

I have used the CrawSpider to crawl through Reddit and its subreddits. But, when I come across pages that have adult content, the site asks for a cookie over18=1.

So, I have been trying to send a cookie with every request that the spider makes, but, its not working out.

Here, is my spider code. As you can see I tried to add a cookie with every spider request using the start_requests() method.

Could anyone here tell me how to do this? Or what I have been doing wrong?

from scrapy import Spider
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from reddit.items import RedditItem
from scrapy.http import Request, FormRequest

class MySpider(CrawlSpider):
    name = 'redditscraper'
    allowed_domains = ['reddit.com', 'imgur.com']
    start_urls = ['https://www.reddit.com/r/nsfw']

    rules = (
        Rule(LinkExtractor(
            allow=['/r/nsfw/\?count=\d*&after=\w*']),
            callback='parse_item',
            follow=True),
    )

    def start_requests(self):
        for i,url in enumerate(self.start_urls):
            print(url)
            yield Request(url,cookies={'over18':'1'},callback=self.parse_item)

    def parse_item(self, response):
        titleList = response.css('a.title')

        for title in titleList:
            item = RedditItem()
            item['url'] = title.xpath('@href').extract()
            item['title'] = title.xpath('text()').extract()
            yield item

@esfy No its not, i guess. I have specified the cookie in `Request(url,cookies={'over18':'1'},callback=self.parse_item)` — Parthapratim Neog, Sep 17 '15 at 06:22

esfy · Accepted Answer · 2015-09-17T09:10:40.613

Okay. Try doing something like this.

def start_requests(self):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36'}
    for i,url in enumerate(self.start_urls):
        yield Request(url,cookies={'over18':'1'}, callback=self.parse_item, headers=headers)

It's the User-Agent which blocks you.

Edit:

Don't know what's wrong with CrawlSpider but Spider could work anyway.

#!/usr/bin/env python
# encoding: utf-8
import scrapy


class MySpider(scrapy.Spider):
    name = 'redditscraper'
    allowed_domains = ['reddit.com', 'imgur.com']
    start_urls = ['https://www.reddit.com/r/nsfw']

    def request(self, url, callback):
        """
         wrapper for scrapy.request
        """
        request = scrapy.Request(url=url, callback=callback)
        request.cookies['over18'] = 1
        request.headers['User-Agent'] = (
            'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, '
            'like Gecko) Chrome/45.0.2454.85 Safari/537.36')
        return request

    def start_requests(self):
        for i, url in enumerate(self.start_urls):
            yield self.request(url, self.parse_item)

    def parse_item(self, response):
        titleList = response.css('a.title')

        for title in titleList:
            item = {}
            item['url'] = title.xpath('@href').extract()
            item['title'] = title.xpath('text()').extract()
            yield item
        url = response.xpath('//a[@rel="nofollow next"]/@href').extract_first()
        if url:
            yield self.request(url, self.parse_item)
        # you may consider scrapy.pipelines.images.ImagesPipeline :D

It worked. But before I accept you answer, i think the cookie only works for the first request, its not working for the pagination requests. i.e its only working for the `start_urls` but not for the paginated URLs that we get from the `LinkExtractor` — Parthapratim Neog, Sep 17 '15 at 06:59
Actually the problem is something else, It i use the `start_requests()` method, the crawling stops at one page. But it I remove it, It starts crawling the pagination. Wonder why! — Parthapratim Neog, Sep 17 '15 at 07:17
Aw, the cookie set by client won't keep themselves between requests like the ones sent by server. Maybe `cookiejar` will do. — esfy, Sep 17 '15 at 08:29
Its working. Great. Thanks for the help. Didn't realize you posted an edit. Funny i didn't get any notifications. Anyway, have to do some research on whats wrong with `CrawlSpider` — Parthapratim Neog, Sep 19 '15 at 11:06

score 5 · Answer 2 · answered Dec 05 '17 at 08:44

The Scrapy Docs

1.Using a dict:

request_with_cookies = Request(url="http://www.example.com",
                               cookies={'currency': 'USD', 'country': 'UY'})

2.Using a list of dicts:

request_with_cookies = Request(url="http://www.example.com",
                               cookies=[{'name': 'currency',
                                        'value': 'USD',
                                        'domain': 'example.com',
                                        'path': '/currency'}])

score 4 · Answer 3 · answered Feb 24 '17 at 02:39

4

You can also send it via header.

scrapy.Request(url=url, callback=callback, headers={'Cookie':my_cookie})

answered Feb 24 '17 at 02:39

Aminah Nuraini

18,120
8
90
108

score 4 · Answer 4 · edited Oct 20 '18 at 16:31

4

You could use the process_request parameter in the rule, something like:

    rules = (
    Rule(LinkExtractor(
        allow=['/r/nsfw/\?count=\d*&after=\w*']),
        callback='parse_item',
        process_request='ammend_req_header',
        follow=True)

    def ammend_req_header(self, request):
        request.cookies['over18']=1
        return request

edited Oct 20 '18 at 16:31

Aurélien B

4,590
3
34
48

answered Oct 20 '18 at 16:10

tomstell

41
1

score 1 · Answer 5 · answered Dec 02 '21 at 05:49

1

I found solution for CrawlSpider:

def start_requests(self):
    yield Request(url=self.start_urls[0], callback=self._parse, cookies={'beget': 'begetok'})

answered Dec 02 '21 at 05:49

Евгений Коровин

11
1

How to send cookie with scrapy CrawlSpider requests?

5 Answers5

The Scrapy Docs

Linked