Still having trouble with this - can anybody help?
I am trying to scrape pages like this using Scrapy. But the website redirects this url to here to get the search results. I don't think it's blocking me from scraping, but just how the website works?
Using Scrapy, because the redirect URL is always the same - the spider thinks it is making duplicate requests to the same page:
2020-01-12 16:14:37 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://search.people.com.cn/cnpeople/news/getNewsResult.jsp> from <GET http://search.people.com.cn/cnpeople/search.do?pageNum=2&keyword=%C8%F0%B5%E4&siteName=news&facetFlag=true&nodeType=belongsId&nodeId=0>
2020-01-12 16:14:37 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://search.people.com.cn/cnpeople/news/getNewsResult.jsp> - no more duplicates will be shown (see DUPEFILTER_DEBUG)
Any way I can follow the redirect and then scrape the links from the landing pages afterwards?
Thanks for any help anybody can offer. I would really appreciate it, as this is presenting a real barrier to me scraping news articles from this Chinese site.