11

I'm working with the CrawlSpider class to crawl a website and I would like to modify the headers that are sent in each request. Specifically, I would like to add the referer to the request.

As per this question, I checked

response.request.headers.get('Referer', None)

in my response parsing function and the Referer header is not present. I assume that means the Referer is not being submitted in the request (unless the website doesn't return it, I'm not sure on that).

I haven't been able to figure out how to modify the headers of a request. Again, my spider is derived from CrawlSpider. Overriding CrawlSpider's _requests_to_follow or specifying a process_request callback for a rule will not work because the referer is not in scope at those times.

Does anyone know how to modify request headers dynamically?

Community
  • 1
  • 1
CatShoes
  • 3,613
  • 5
  • 29
  • 43

2 Answers2

23

You can pass REFERER manually to each request using headers argument:

yield Request(parse=..., headers={'referer':...})

RefererMiddleware does the same, automatically taking the referrer url from the previous response.

warvariuc
  • 57,116
  • 41
  • 173
  • 227
  • Great, I will keep that in mind for the future. In the current setup, I'm not creating requests manually (my rules are taking care of that job). – CatShoes Jan 09 '13 at 13:30
17

You have to enable the SpiderMiddleware that will populate the referer for responses. See the documentation for scrapy.contrib.spidermiddleware.referer.RefererMiddleware

In short, you need to add this middleware to your project's settings file.

SPIDER_MIDDLEWARES = {
'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': True,
}

Then in your response parsing method, you can use, response.request.headers.get('Referrer', None), to get the referer.

Smart Manoj
  • 5,230
  • 4
  • 34
  • 59
CatShoes
  • 3,613
  • 5
  • 29
  • 43
  • 1
    RefererMiddleware is active by default in BASE_Settings , there is no need to activate them in your spider settings . – akhter wahab Jan 10 '13 at 13:55
  • @akhterwahab Hmm. I didn't have referers in my request headers until I added the above to my project settings, which hadn't been previously modified. I do see that the default for the setting is true. Nevertheless they weren't working for me. – CatShoes Jan 10 '13 at 15:22
  • btw: response.request.headers.get('Referer', None) is the correct usage. 'Referrer' will not give correct results. – BgRva Feb 12 '14 at 00:36
  • Can you add a cookie with every request using this method?? – Parthapratim Neog Sep 17 '15 at 05:50
  • 1
    @ParthapratimNeog yes you can add using `headers={"Cookie":"ubid-acbuk=253-3565238-8236411; session-id=256-4008452-7910856"}` – Umair Ayub Oct 09 '16 at 01:21