What should i do to enable cookies and use scrapy for this url?

Question

I am using scrapy for a scrapying project with this url https://www.walmart.ca/en/clothing-shoes-accessories/men/mens-tops/N-2566+11

I tried to play with the url and open it in the shell, but it got 430 error, so i added some settings to the header like that:

scrapy shell -s COOKIES_ENABLED=1 -s USER_AGENT='Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0' "https://www.walmart.ca/en/clothing-shoes-accessories/men/mens-tops/N-2566+11"

it got the page "200", but once i use view(response), it directed me to a page that say: Sorry! Your web browser is not accepting cookies.

here is a screenshot of the log:

Umair Ayub · Answer 1 · 2017-06-16T09:06:30.707

3

You should have

COOKIES_ENABLED = True

in your settings.py file.

Also see

COOKIES_DEBUG = True

To debug cookies, you will see what cookies are coming/outgoing which each response/request respectively.

edited Jun 16 '17 at 09:06

answered Jun 15 '17 at 12:56

Umair Ayub

19,358
14
72
146

It did not solved the problem, Here is the log: `Set-Cookie: akaau_P1=1497629246~id=6be87a77f26506d101e24517432b9abc; path=/ 2017-06-16 17:37:26 [scrapy.core.engine] DEBUG: Crawled (403) (referer: None) 2017-06-16 17:37:26 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.walmart.ca/en/clothing-shoes-accessories/men/mens-tops/N-2566+11>: HTTP status code is not handled or not allowed` – Hat hout Jun 16 '17 at 15:40

score 2 · Answer 2 · answered Oct 20 '21 at 10:01

If the web page requires to click to accept cookies, you can use FormRequest.from_response

Here is an example with Google consent page

def start_requests(self):
  yield Request(
    "https://google.com/",
    callback=self.parse_consent,
  )

def parse_consent(self, response):
  yield FormRequest.from_response(
    response,
    clickdata={"value": "I agree"},
    callback=self.parse_query,
    dont_filter=True,
  )

def parse_query(self, response):
  for keyword in self.keywords:
    yield Request(
      <google_url_to_parse>,
      callback=<your_callback>,
      dont_filter=True,
    )

Note that the value of clickdata may defer base on your location/language, you should change "I agree" to a correct value.

Umair Ayub · Answer 3 · 2017-06-18T06:32:39.377

Try to send all required headers.

headers = {
    'dnt': '1',
    'accept-encoding': 'gzip, deflate, sdch, br',
    'accept-language': 'en-US,en;q=0.8',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'cache-control': 'max-age=0',
    'authority': 'www.walmart.ca',
    'cookie': 'JSESSIONID=E227789DA426B03664F0F5C80412C0BB.restapp-108799501-8-112264256; cookieLanguageType=en; deliveryCatchment=2000; marketCatchment=2001; zone=2; originalHttpReferer=; walmart.shippingPostalCode=V5M2G7; defaultNearestStoreId=1015; walmart.csrf=6f635f71ab4ae4479b8e959feb4f3e81d0ac9d91-1497631184063-441217ff1a8e4a311c2f9872; wmt.c=0; userSegment=50-percent; akaau_P1=1497632984~id=bb3add0313e0873cf64b5e0a73e3f5e3; wmt.breakpoint=d; TBV=7; ENV=ak-dal-prod; AMCV_C4C6370453309C960A490D44%40AdobeOrg=793872103%7CMCIDTS%7C17334',
    'referer': 'https://www.walmart.ca/en/clothing-shoes-accessories/men/mens-tops/N-2566+11',
}

yield Request(url = 'https://www.walmart.ca/en/clothing-shoes-accessories/men/mens-tops/N-2566+11', headers=headers)

You can implement in your way like this, instead of using start_urls i would recommend start_requests() method. Its easy to read.

class EasySpider(CrawlSpider): 
    name = 'easy' 

    def start_requests(self):
        headers = {
        'dnt': '1',
        'accept-encoding': 'gzip, deflate, sdch, br',
        'accept-language': 'en-US,en;q=0.8',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'cache-control': 'max-age=0',
        'authority': 'www.walmart.ca',
        'cookie': 'JSESSIONID=E227789DA426B03664F0F5C80412C0BB.restapp-108799501-8-112264256; cookieLanguageType=en; deliveryCatchment=2000; marketCatchment=2001; zone=2; originalHttpReferer=; walmart.shippingPostalCode=V5M2G7; defaultNearestStoreId=1015; walmart.csrf=6f635f71ab4ae4479b8e959feb4f3e81d0ac9d91-1497631184063-441217ff1a8e4a311c2f9872; wmt.c=0; userSegment=50-percent; akaau_P1=1497632984~id=bb3add0313e0873cf64b5e0a73e3f5e3; wmt.breakpoint=d; TBV=7; ENV=ak-dal-prod; AMCV_C4C6370453309C960A490D44%40AdobeOrg=793872103%7CMCIDTS%7C17334',
        'referer': 'https://www.walmart.ca/en/clothing-shoes-accessories/men/mens-tops/N-2566+11',
        }       

        yield Request(url = 'https://www.walmart.ca/en/clothing-shoes-accessories/men/m‌ens-tops/N-2566+11', callback = self.parse_item, headers = headers)

        def parse_item(self, response): 
            i = CravlingItem() 
            i['title'] = " ".join( response.xpath('//a/text()').extract()).strip() 
            yield i

Could you please explain, how to implement this on the code Here is my code: `class EasySpider(CrawlSpider): name = 'easy' start_urls = ['https://www.walmart.ca/en/clothing-shoes-accessories/men/mens-tops/N-2566+11'] def parse_item(self, response): i = CravlingItem() i['title'] = " ".join( response.xpath('//a/text()').extract()).strip() return i` — Hat hout, Jun 18 '17 at 03:01

score 0 · Answer 4 · answered Aug 30 '20 at 06:20

I can confirm that COOKIES_ENABLED setting does not help in fixing the error. Instead, using the following googlebot USER_AGENT made it work:

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; http://www.google.com/bot.html) Chrome/W.X.Y.Z‡ Safari/537.36

I figured this out thanks to the person who made this script, which uses that User Agent to make the requests: https://github.com/juansimon27/scrapy-walmart/blob/master/product_scraping/spiders/spider.py

What should i do to enable cookies and use scrapy for this url?

4 Answers4

Linked