4

For each of several Disqus users, whose profile urls are known in advance, I want to scrape their names and usernames of their followers. I'm using scrapy and splash do to so. However, when I'm parsing the responses, it seems that it is always scraping the page of the first user. I tried setting wait to 10 and dont_filter to True, but it isn't working. What should I do now?

Here is my spider:

import scrapy
from disqus.items import DisqusItem

class DisqusSpider(scrapy.Spider):
    name = "disqusSpider"
    start_urls = ["https://disqus.com/by/disqus_sAggacVY39/", "https://disqus.com/by/VladimirUlayanov/", "https://disqus.com/by/Beasleyhillman/", "https://disqus.com/by/Slick312/"]
    splash_def = {"endpoint" : "render.html", "args" : {"wait" : 10}}

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url = url, callback = self.parse_basic, dont_filter = True, meta = {
                "splash" : self.splash_def,
                "base_profile_url" : url
            })

    def parse_basic(self, response):
        name = response.css("h1.cover-profile-name.text-largest.truncate-line::text").extract_first()
        disqusItem = DisqusItem(name = name)
        request = scrapy.Request(url = response.meta["base_profile_url"] + "followers/", callback = self.parse_followers, dont_filter = True, meta = {
            "item" : disqusItem,
            "base_profile_url" : response.meta["base_profile_url"],
            "splash": self.splash_def
        })
        print "parse_basic", response.url, request.url
        yield request

    def parse_followers(self, response):
        print "parse_followers", response.meta["base_profile_url"], response.meta["item"]
        followers = response.css("div.user-info a::attr(href)").extract()

DisqusItem is defined as follows:

class DisqusItem(scrapy.Item):
    name = scrapy.Field()
    followers = scrapy.Field()

Here are the results:

2017-08-07 23:09:12 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://localhost:8050/render.html> (referer: None)
parse_followers https://disqus.com/by/disqus_sAggacVY39/ {'name': u'Trailer Trash'}
2017-08-07 23:09:14 [scrapy.extensions.logstats] INFO: Crawled 5 pages (at 5 pages/min), scraped 0 items (at 0 items/min)
2017-08-07 23:09:18 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://localhost:8050/render.html> (referer: None)
parse_followers https://disqus.com/by/VladimirUlayanov/ {'name': u'Trailer Trash'}
2017-08-07 23:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://localhost:8050/render.html> (referer: None)
parse_followers https://disqus.com/by/Beasleyhillman/ {'name': u'Trailer Trash'}
2017-08-07 23:09:40 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://localhost:8050/render.html> (referer: None)
parse_followers https://disqus.com/by/Slick312/ {'name': u'Trailer Trash'}

Here is the file settings.py:

# -*- coding: utf-8 -*-

# Scrapy settings for disqus project
#

BOT_NAME = 'disqus'

SPIDER_MODULES = ['disqus.spiders']
NEWSPIDER_MODULE = 'disqus.spiders'

ROBOTSTXT_OBEY = False

SPLASH_URL = 'http://localhost:8050' 

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'
DUPEFILTER_DEBUG = True

DOWNLOAD_DELAY = 10
Gallaecio
  • 3,620
  • 2
  • 25
  • 64
Milos
  • 518
  • 7
  • 22
  • 1
    Looks like splash is giving scrapy the same initially rendered html regardless of the url. Does adding a wait help: `'args':{'wait': 2}` (2 seconds)? – alecxe Aug 07 '17 at 21:34
  • I have `"args" : {"wait" : 10}` already. Take a look at the attribute `splash_def`. However, it doesn't seem to be working. – Milos Aug 07 '17 at 22:03
  • Ah, yes, missed that, thanks. Don't have a way to debug at this point, will check back later if nobody will help. – alecxe Aug 07 '17 at 22:05
  • Ok. Thanks for taking a look at this post. – Milos Aug 07 '17 at 22:07
  • Could you try replacing `"splash" : self.splash_def` with `"splash": self.spash_def.copy()` everywhere? – Mikhail Korobov Aug 09 '17 at 09:50
  • I tried it, but it didn't work, unfortunately. – Milos Aug 09 '17 at 13:30

1 Answers1

2

I was able to get it to work using SplashRequest instead of scrapy.Request.

ex:

import scrapy
from disqus.items import DisqusItem
from scrapy_splash import SplashRequest


class DisqusSpider(scrapy.Spider):
    name = "disqusSpider"
    start_urls = ["https://disqus.com/by/disqus_sAggacVY39/", "https://disqus.com/by/VladimirUlayanov/", "https://disqus.com/by/Beasleyhillman/", "https://disqus.com/by/Slick312/"]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse_basic, dont_filter = True, endpoint='render.json',
                        args={
                            'wait': 2,
                            'html': 1
                        })
P. Vaden
  • 98
  • 8
  • Thank you! I will test your solution tomorrow. It would be even better if you explained the changes and differences in arguments. :) – Milos Aug 11 '17 at 00:23
  • I just changed scrapy.Request to SplashRequest, which is the preferred method for splash requests. endpoint='render.json' means that it returns a json encoded dictionary, and 'html': 1 means it also gets the html. It would also work if you just set endpoint='render.html'. Let me know if this works for you, I'm not exactly sure why it makes a difference. – P. Vaden Aug 11 '17 at 17:02
  • Your solution works, so I accepted the answer. :) Thanks. :) – Milos Aug 12 '17 at 23:36