1

To be clear I try to crawl forums about casino, for now, I have succeeded to do so using the same scheme as below :

class test_spider(scrapy.Spider):
count=0

name = "test_spyder"

start_urls = [

       'https://casinogrounds.com/forum/search/?&q=Casino&search_and_or=or&sortby=relevancy',

]

rules = ( Rule(LinkExtractor(restrict_css=('a:contains("Next")::attr(href)')), callback='parse') )


def parse(self, response) :
    print(self.count)
    for href in response.css("span.ipsType_break.ipsContained a::attr(href)") :
        new_url = response.urljoin(href.extract())
        #print(new_url)
        yield scrapy.Request(new_url, callback = self.parse_review)


    next_page = response.css('a:contains("Next")::attr(href)').extract_first()
    print(next_page)
    if next_page is not None:
        yield scrapy.Request(next_page, callback = self.parse)

def parse_review(self, response):

    parsed_uri = urlparse(response.url)
    domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
    for review in response.css('article.cPost.ipsBox.ipsComment.ipsComment_parent.ipsClearfix.ipsClear.ipsColumns.ipsColumns_noSpacing.ipsColumns_collapsePhone') :

        yield {
            'name': review.css('strong a.ipsType_break::text').extract_first(),
            'date': review.css('time::attr(title)').extract_first(),
            'review': review.css('p::text').extract(),
            'url' : response.url
        }


    next_page = response.css('li.ipsPagination_next a::attr(href)').extract_first()
    if next_page is not None:
        yield response.follow(next_page, callback=self.parse_review)

So when I execute that spider within a python scripts, normally (I mean for other forums) it crawled all the threads of all the pages from the start url.

But for that one it does not, it scraps only all the threads of the first page, it gets the right URL for going onto the second page, but do call another time the parse function.

And of course, if I put all the URL of the pages in the start_urls list it scraps all pages...

Thank you for the help.

Nazim Kerimbekov
  • 4,712
  • 8
  • 34
  • 58
Siktime
  • 51
  • 8
  • When you say it gets the right URL, are you saying `print(next_page)` shows the right URL? – Aankhen Jul 10 '18 at 11:55
  • Yes thats exactly right it gives me : https://casinogrounds.com/forum/search/?q=Casino&sortby=relevancy&search_and_or=or&page=2 – Siktime Jul 10 '18 at 12:07
  • And it doesn’t `print(self.count)` for that page? – Aankhen Jul 10 '18 at 12:13
  • No, with the URL from start_urls is shoes the self.count then shows the right next URL thanks to the print(next_page) but then the scraping ends. I also shall notice you that sometimes it goes to the second page and then ends from here. – Siktime Jul 10 '18 at 12:20
  • Last run gives me that : >>>>> Starting the scraping..... <<<<<< 0 https://casinogrounds.com/forum/search/?q=Casino&sortby=relevancy&search_and_or=or&page=2 0 https://casinogrounds.com/forum/search/?q=Casino&page=3&sortby=relevancy&search_and_or=or >>>>> End of scraping ! <<<<< – Siktime Jul 10 '18 at 12:21
  • Your code looks good. Can you show debug output? – gangabass Jul 10 '18 at 12:34
  • Where do I find it? I am executing that spider from a script, and I am new to web crawling :/ – Siktime Jul 10 '18 at 12:40
  • I have found that line in the debugging part (I executed via the console to get the debbug output) : 2018-07-10 14:59:24 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <429 https://casinogrounds.com/forum/search/?q=Casino&sortby=relevancy&search_and_or=or&page=2>: HTTP status code is not handled or not allowed – Siktime Jul 10 '18 at 13:03
  • Ah, that means the site is throttling your requests. Use [the AutoThrottle extension](https://doc.scrapy.org/en/latest/topics/autothrottle.html) to keep the frequency of your requests manageable. – Aankhen Jul 10 '18 at 13:10
  • I'll try that :) – Siktime Jul 10 '18 at 13:10
  • Thank you very much @Aankhen that was it, I tried with increasing manually the downloading delay in the settings.py and it worked. I'll now implement the AutoThrottle to get the optimum delay. Do you want to post it as an answer so that I can approve it ? – Siktime Jul 10 '18 at 13:20

1 Answers1

1

The HTTP 429 response you’re getting means that the site is throttling your requests to avoid being overwhelmed. You can use the AutoThrottle extension to limit the frequency of your requests to what the site will allow.

Aankhen
  • 2,198
  • 11
  • 19