2

Still having trouble with this - can anybody help?

I am trying to scrape pages like this using Scrapy. But the website redirects this url to here to get the search results. I don't think it's blocking me from scraping, but just how the website works?

Using Scrapy, because the redirect URL is always the same - the spider thinks it is making duplicate requests to the same page:

2020-01-12 16:14:37 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://search.people.com.cn/cnpeople/news/getNewsResult.jsp> from <GET http://search.people.com.cn/cnpeople/search.do?pageNum=2&keyword=%C8%F0%B5%E4&siteName=news&facetFlag=true&nodeType=belongsId&nodeId=0>
2020-01-12 16:14:37 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://search.people.com.cn/cnpeople/news/getNewsResult.jsp> - no more duplicates will be shown (see DUPEFILTER_DEBUG)

Any way I can follow the redirect and then scrape the links from the landing pages afterwards?

Thanks for any help anybody can offer. I would really appreciate it, as this is presenting a real barrier to me scraping news articles from this Chinese site.

Moein Kameli
  • 976
  • 1
  • 12
  • 21
Nick Olczak
  • 305
  • 3
  • 14
  • Does [this](https://stackoverflow.com/a/27949956/9491733) answer you question? – Moein Kameli Jan 12 '20 at 17:30
  • Thanks for replying! I don't think so. Chad Casey's solution - modifying make_requests_from_url by inserting that code into the spider doesn't seem to do anything. Using Requests Meta - I then get Scrapy saying: "Ignoring response <302 http://search.people.com.cn/cnpeople/search.do?pageNum=7&keyword=%C8%F0%B5%E4&siteName=news&facetFlag=true&nodeType=belongsId&nodeId=0>: HTTP status code is not handled or not allowed – Nick Olczak Jan 12 '20 at 18:01
  • @Piron part of the problem is I don’t really want to stop the redirect, so much as follow it and then collect the data from the redirected page. – Nick Olczak Jan 12 '20 at 19:15
  • 2
    The website in question seems to be a bit broken in it's non-standard behaviour. Have you tried simply turning off duplicate filter in scrapy? add `dont_filter=True` keyword argument to your `Request` object. – Granitosaurus Jan 13 '20 at 05:35
  • @Granitosaurus thank you so much! That seems to work partly - and is the closest I got to a solution so far... Any idea why the link Scrapy grabs sometimes is the start url? E.g. `2020-01-13 17:35:34 [scrapy.core.scraper] DEBUG: Scraped from <200 http://search.people.com.cn/cnpeople/news/getNewsResult.jsp> {'link': '/cnpeople/search.do?pageNum=1&keyword=%C8%F0%B5%E4&siteName=news&dateFlag=false&facetFlag=true&nodeType=belongsId&nodeId=1002'}` – Nick Olczak Jan 13 '20 at 16:49
  • 1
    This seems related to https://github.com/scrapy/scrapy/issues/1225 – Gallaecio Feb 19 '20 at 10:55

0 Answers0