Web scraping forum with scrapy do not yield the next page

Question

To be clear I try to crawl forums about casino, for now, I have succeeded to do so using the same scheme as below :

class test_spider(scrapy.Spider):
count=0

name = "test_spyder"

start_urls = [

       'https://casinogrounds.com/forum/search/?&q=Casino&search_and_or=or&sortby=relevancy',

]

rules = ( Rule(LinkExtractor(restrict_css=('a:contains("Next")::attr(href)')), callback='parse') )


def parse(self, response) :
    print(self.count)
    for href in response.css("span.ipsType_break.ipsContained a::attr(href)") :
        new_url = response.urljoin(href.extract())
        #print(new_url)
        yield scrapy.Request(new_url, callback = self.parse_review)


    next_page = response.css('a:contains("Next")::attr(href)').extract_first()
    print(next_page)
    if next_page is not None:
        yield scrapy.Request(next_page, callback = self.parse)

def parse_review(self, response):

    parsed_uri = urlparse(response.url)
    domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
    for review in response.css('article.cPost.ipsBox.ipsComment.ipsComment_parent.ipsClearfix.ipsClear.ipsColumns.ipsColumns_noSpacing.ipsColumns_collapsePhone') :

        yield {
            'name': review.css('strong a.ipsType_break::text').extract_first(),
            'date': review.css('time::attr(title)').extract_first(),
            'review': review.css('p::text').extract(),
            'url' : response.url
        }


    next_page = response.css('li.ipsPagination_next a::attr(href)').extract_first()
    if next_page is not None:
        yield response.follow(next_page, callback=self.parse_review)

So when I execute that spider within a python scripts, normally (I mean for other forums) it crawled all the threads of all the pages from the start url.

But for that one it does not, it scraps only all the threads of the first page, it gets the right URL for going onto the second page, but do call another time the parse function.

And of course, if I put all the URL of the pages in the start_urls list it scraps all pages...

Thank you for the help.

When you say it gets the right URL, are you saying `print(next_page)` shows the right URL? — Aankhen, Jul 10 '18 at 11:55
Yes thats exactly right it gives me : https://casinogrounds.com/forum/search/?q=Casino&sortby=relevancy&search_and_or=or&page=2 — Siktime, Jul 10 '18 at 12:07
No, with the URL from start_urls is shoes the self.count then shows the right next URL thanks to the print(next_page) but then the scraping ends. I also shall notice you that sometimes it goes to the second page and then ends from here. — Siktime, Jul 10 '18 at 12:20
Last run gives me that : >>>>> Starting the scraping..... <<<<<< 0 https://casinogrounds.com/forum/search/?q=Casino&sortby=relevancy&search_and_or=or&page=2 0 https://casinogrounds.com/forum/search/?q=Casino&page=3&sortby=relevancy&search_and_or=or >>>>> End of scraping ! <<<<< — Siktime, Jul 10 '18 at 12:21
Where do I find it? I am executing that spider from a script, and I am new to web crawling :/ — Siktime, Jul 10 '18 at 12:40
I have found that line in the debugging part (I executed via the console to get the debbug output) : 2018-07-10 14:59:24 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <429 https://casinogrounds.com/forum/search/?q=Casino&sortby=relevancy&search_and_or=or&page=2>: HTTP status code is not handled or not allowed — Siktime, Jul 10 '18 at 13:03
Ah, that means the site is throttling your requests. Use [the AutoThrottle extension](https://doc.scrapy.org/en/latest/topics/autothrottle.html) to keep the frequency of your requests manageable. — Aankhen, Jul 10 '18 at 13:10
Thank you very much @Aankhen that was it, I tried with increasing manually the downloading delay in the settings.py and it worked. I'll now implement the AutoThrottle to get the optimum delay. Do you want to post it as an answer so that I can approve it ? — Siktime, Jul 10 '18 at 13:20

score 1 · Accepted Answer · answered Jul 10 '18 at 13:35

1

The HTTP 429 response you’re getting means that the site is throttling your requests to avoid being overwhelmed. You can use the AutoThrottle extension to limit the frequency of your requests to what the site will allow.

answered Jul 10 '18 at 13:35

Aankhen

2,198
11
19

Web scraping forum with scrapy do not yield the next page

1 Answers1