0

Rewriting this to make what I'm looking for help with clearer. I'm trying to scrape a page of search results like this

http://search.people.com.cn/cnpeople/search.do?pageNum=1&keyword=%C8%F0%B5%E4&siteName=news&facetFlag=true&nodeType=belongsId&nodeId=0

But when I run it in Scrapy, the requests seem to be redirected:

2020-01-10 09:55:38 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to http://search.people.com.cn/cnpeople/news/getNewsResult.jsp> from http://search.people.com.cn/cnpeople/search.do?pageNum=7&keyword=%C8%F0%B5%E4&siteName=news&facetFlag=true&nodeType=belongsId&nodeId=0>

And then nothing is scraped.

Is that just the way the website works to redirect me to a list of results, or is it trying to prevent me scraping it? Is there anything I can do?

Below is my spider code:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "RMW"

    def start_requests(self):
        # starturls = ['http://search.people.com.cn/cnpeople/search.do?pageNum=1&keyword=%C8%F0%B5%E4&siteName=news&facetFlag=true&nodeType=belongsId&nodeId=0',]

        numbers = list(range(1, 10, 1))
        for num in numbers:
            url = 'http://search.people.com.cn/cnpeople/search.do?pageNum='+str(num)+'&keyword=%C8%F0%B5%E4&siteName=news&facetFlag=true&nodeType=belongsId&nodeId=0'
            urls = []
            urls.append(url)
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for link in response.css("ul"):
            yield {
                'link': link.css("a::attr(href)").get()
            }

I'd really appreciate any help resolving this from somebody with more expertise in the area.

Nick Olczak
  • 305
  • 3
  • 14
  • 1
    This may help you: https://stackoverflow.com/questions/22795416/how-to-handle-302-redirect-in-scrapy. It handles redirections. – Dawid Gacek Jan 10 '20 at 09:22
  • Thanks, I had been looking at that. I’m not sure if I’m my case the redirect is the server blocking me scraping or just part of the way the website delivers search results? Was hoping somebody might advise on this... – Nick Olczak Jan 10 '20 at 11:12

0 Answers0