0

I am new to Scrapy framework & currently using it to extract articles from multiple 'Health & Wellness' websites. For some of the requests, scrapy is redirecting to homepage(this behavior is not observed in browser). Below is an example:

Command: scrapy shell "http://www.bornfitness.com/blog/page/10/" Result: 2015-06-19 21:32:15+0530 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080 2015-06-19 21:32:15+0530 [default] INFO: Spider opened 2015-06-19 21:32:15+0530 [default] DEBUG: Redirecting (301) to http://www.bornfitness.com/> from http://www.bornfitness.com/blog/page/10/> 2015-06-19 21:32:16+0530 [default] DEBUG: Crawled (200) http://www.bornfitness.com/> (referer: None)

Note that the page number in url(10) is a two-digit number. I don't see this issue with urls with single-sigit page number(8 for example). Result: 2015-06-19 21:43:15+0530 [default] INFO: Spider opened 2015-06-19 21:43:16+0530 [default] DEBUG: Crawled (200) http://www.bornfitness.com/blog/page/8/> (referer: None)

Aditya
  • 13
  • 2
  • Are you getting a 301 _before_ the 200 for the successful page load? i.e. -- does page 8 redirect you to page 8? – tegancp Jun 19 '15 at 19:05

1 Answers1

1

When you have trouble replicating browser behavior using scrapy, you generally want to look at what are those things which are being communicated differently when your browser is talking to the website compared with when your spider is talking to the website. Remember that a website is (almost always) not designed to be nice to webcrawlers, but to interact with web browsers.

For your situation, if you look at the headers being sent with your scrapy request, you should see something like:

In [1]: request.headers
Out[1]:
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 'Accept-Encoding': 'gzip,deflate',
 'Accept-Language': 'en',
 'User-Agent': 'Scrapy/0.24.6 (+http://scrapy.org)'}

If you examine the headers sent by a request for the same page by your web browser, you might see something like:

**Request Headers**

GET /blog/page/10/ HTTP/1.1    
Host: www.bornfitness.com    
Connection: keep-alive    
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36
DNT: 1    
Referer: http://www.bornfitness.com/blog/page/11/
Accept-Encoding: gzip, deflate, sdch    
Accept-Language: en-US,en;q=0.8
Cookie: fealty_segment_registeronce=1; ... ... ...

Try changing the User-Agent in your request. This should allow you to get around the redirect.

tegancp
  • 1,204
  • 6
  • 13
  • Thanks, changing USER_AGENT from default 'Scrapy/0.24.6 (+http://scrapy.org)' to 'born_fitness'(or anything) resolved the issue. Any idea why this is happening only for some urls(/page/10/ but not /page/8/) & why only for USER_AGENT 'Scrapy/0.24.6 (+http://scrapy.org)' ? – Aditya Jun 20 '15 at 20:53