0

So I am trying to scrape articles from news website that has an infinite scroll type layout so the following is what happens:

example.com has first page of articles

example.com/page/2/ has second page

example.com/page/3/ has third page

And so on. As you scroll down, the url changes. To account for that, I wanted to scrape the first x number of articles and did the following:

start_urls = ['http://example.com/']
for x in range(1,x):
    new_url  = 'http://www.example.com/page/' + str(x) +'/'
    start_urls.append(new_url)

It seems to work fine for the first 9 pages and I get something like the following:

Redirecting (301) to <GET http://example.com/page/4/> from <GET http://www.example.com/page/4/>
Redirecting (301) to <GET http://example.com/page/5/> from <GET http://www.example.com/page/5/>
Redirecting (301) to <GET http://example.com/page/6/> from <GET http://www.example.com/page/6/>
Redirecting (301) to <GET http://example.com/page/7/> from <GET http://www.example.com/page/7/>
2017-09-08 17:36:23 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 3 pages/min), scraped 0 items (at 0 items/min)
Redirecting (301) to <GET http://example.com/page/8/> from <GET http://www.example.com/page/8/>
Redirecting (301) to <GET http://example.com/page/9/> from <GET http://www.example.com/page/9/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/10/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/11/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/12/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/13/>

Starting from page 10, it redirects to a page like example.com/ from example.com/page/10/ instead of the original link, example.com/page/10. What can be causing this behavior?

I looked into a couple options like dont_redirect, but I just don't understand what is happening. What can be the reason for this re-direction behavior? Especially since no re-direction happens when you directly type in the link for the website like example.com/page/10?

Any help would be greatly appreciated, thanks!!

[EDIT]

class spider(CrawlSpider):
    start_urls = ['http://example.com/']

    for x in range(startPage,endPage):
        new_url  = 'http://www.example.com/page/' + str(x) +'/'
        start_urls.append(new_url)
   custom_settings = {'DEPTH_PRIORITY': 1, 'DEPTH_LIMIT': 1}


rules = (
    Rule(LinkExtractor(allow=('some regex here,')deny=('example\.com/page/.*','some other regex',),callback='parse_article'),
)

def parse_article(self, response):
    #some parsing work here 
    yield item

Is it because I include example\.com/page/.* in the LinkExtractor? Shouldn't that only apply to links that are not the start_url however?

ocean800
  • 3,489
  • 13
  • 41
  • 73

1 Answers1

1

looks like this site uses some kind of security to only check the User-Agent in the request headers.

So you only need to add a common User-Agent in the settings.py file:

USER_AGENT = 'Mozilla/5.0'

Also, the spider doesn't necessarily need the start_urls attribute to get the starting sites, you can also use the start_requests method, so replace all the creating of start_urls with:

class spider(CrawlSpider):

    ...

    def start_requests(self):
        for x in range(1,20):
            yield Request('http://www.example.com/page/' + str(x) +'/')

    ...
eLRuLL
  • 18,488
  • 9
  • 73
  • 99
  • Thanks! Are you saying that because of the user-agent it was re-directing? Why would my user-agent not be the same for all requests? Do requests not get re-directed when you issue them directly with a `Request()`? Sorry, just trying to understand more, thank you! – ocean800 Sep 23 '17 at 02:42
  • the `USER_AGENT` in `settings.py` is used by all the requests executed in the spider – eLRuLL Sep 23 '17 at 02:46
  • Thanks, just to clarify, I'm asking why the changes above would stop the redirect of `example.com/page/10/` --> `example.com` when the request is issued from the spider if that makes sense. – ocean800 Sep 23 '17 at 02:49
  • have you tried what I suggested? There isn't an explanation for how a site that you didn't code (but other programmer) works, you just need to find how to make it work, and using the User-Agent for the requests solves this case. – eLRuLL Sep 23 '17 at 02:53