Scrapy thinks redirects are duplicate requests

Question

Still having trouble with this - can anybody help?

I am trying to scrape pages like this using Scrapy. But the website redirects this url to here to get the search results. I don't think it's blocking me from scraping, but just how the website works?

Using Scrapy, because the redirect URL is always the same - the spider thinks it is making duplicate requests to the same page:

2020-01-12 16:14:37 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://search.people.com.cn/cnpeople/news/getNewsResult.jsp> from <GET http://search.people.com.cn/cnpeople/search.do?pageNum=2&keyword=%C8%F0%B5%E4&siteName=news&facetFlag=true&nodeType=belongsId&nodeId=0>
2020-01-12 16:14:37 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://search.people.com.cn/cnpeople/news/getNewsResult.jsp> - no more duplicates will be shown (see DUPEFILTER_DEBUG)

Any way I can follow the redirect and then scrape the links from the landing pages afterwards?

Thanks for any help anybody can offer. I would really appreciate it, as this is presenting a real barrier to me scraping news articles from this Chinese site.

Does [this](https://stackoverflow.com/a/27949956/9491733) answer you question? — Moein Kameli, Jan 12 '20 at 17:30
Thanks for replying! I don't think so. Chad Casey's solution - modifying make_requests_from_url by inserting that code into the spider doesn't seem to do anything. Using Requests Meta - I then get Scrapy saying: "Ignoring response <302 http://search.people.com.cn/cnpeople/search.do?pageNum=7&keyword=%C8%F0%B5%E4&siteName=news&facetFlag=true&nodeType=belongsId&nodeId=0>: HTTP status code is not handled or not allowed — Nick Olczak, Jan 12 '20 at 18:01
@Piron part of the problem is I don’t really want to stop the redirect, so much as follow it and then collect the data from the redirected page. — Nick Olczak, Jan 12 '20 at 19:15
The website in question seems to be a bit broken in it's non-standard behaviour. Have you tried simply turning off duplicate filter in scrapy? add `dont_filter=True` keyword argument to your `Request` object. — Granitosaurus, Jan 13 '20 at 05:35
@Granitosaurus thank you so much! That seems to work partly - and is the closest I got to a solution so far... Any idea why the link Scrapy grabs sometimes is the start url? E.g. `2020-01-13 17:35:34 [scrapy.core.scraper] DEBUG: Scraped from <200 http://search.people.com.cn/cnpeople/news/getNewsResult.jsp> {'link': '/cnpeople/search.do?pageNum=1&keyword=%C8%F0%B5%E4&siteName=news&dateFlag=false&facetFlag=true&nodeType=belongsId&nodeId=1002'}` — Nick Olczak, Jan 13 '20 at 16:49
This seems related to https://github.com/scrapy/scrapy/issues/1225 — Gallaecio, Feb 19 '20 at 10:55

Scrapy thinks redirects are duplicate requests

0 Answers0