10

every time i run my code my ip gets banned. I need help to delay each request for 10 seconds. I've tried to place DOWNLOAD_DELAY in code but it gives no results. Any help is appreciated.

# item class included here
        class DmozItem(scrapy.Item):
            # define the fields for your item here like:
            link = scrapy.Field()
            attr = scrapy.Field()


        class DmozSpider(scrapy.Spider):
            name = "dmoz"
            allowed_domains = ["craigslist.org"]
            start_urls = [
            "https://washingtondc.craigslist.org/search/fua"
            ]

            BASE_URL = 'https://washingtondc.craigslist.org/'

            def parse(self, response):
                links = response.xpath('//a[@class="hdrlnk"]/@href').extract()
                for link in links:
                    absolute_url = self.BASE_URL + link
                    yield scrapy.Request(absolute_url, callback=self.parse_attr)

            def parse_attr(self, response):
                match = re.search(r"(\w+)\.html", response.url)
                if match:
                    item_id = match.group(1)
                    url = self.BASE_URL + "reply/nos/vgm/" + item_id

                    item = DmozItem()
                    item["link"] = response.url

                    return scrapy.Request(url, meta={'item': item}, callback=self.parse_contact)

            def parse_contact(self, response):
                item = response.meta['item']
                item["attr"] = "".join(response.xpath("//div[@class='anonemail']//text()").extract())
                return item
Arkan Kalu
  • 403
  • 2
  • 4
  • 16
  • try this before your request time.sleep(10) – Ajay May 22 '15 at 19:21
  • Where should i put time.sleep() exactly? – Arkan Kalu May 22 '15 at 21:29
  • may be after this line i guess absolute_url = self.BASE_URL + link – Ajay May 22 '15 at 21:43
  • I like scraping in general, but I'm also mostly in favour of content owners being able to IP block scrapers if they wish. If your project is based entirely on scraping Craigslist, bear in mind you might encounter legal as well as technical restrictions, and that you may be forced to gather your data from elsewhere. – halfer May 22 '15 at 21:50
  • Where did you try to put DOWNLOAD_DELAY? You should put it in `settings.py` if you're in a Scrapy project. If you're running the spider with `scrapy runspider file.py` you need to use `-s DOWNLOAD_DELAY=10` in the command line. – Elias Dorneles May 23 '15 at 01:34
  • Don't use time.sleep(), instead use DOWNLOAD_DELAY. If you use time.sleep, twisted library (scrapy's underlying library) asynchronous functionality will not work properly. If even that doesn't work, you can use scraperAPI or any other 3rd party proxy providers. That will basically make it look like many people are downloading little little amounts of data instead of you downloading everything. – Anmol Deep Mar 03 '23 at 11:05

1 Answers1

22

You need to set DOWNLOAD_DELAY in settings.py of your project. Note that you may also need to limit concurrency. By default concurrency is 8 so you are hitting website with 8 simultaneous requests.

# settings.py
DOWNLOAD_DELAY = 1
CONCURRENT_REQUESTS_PER_DOMAIN = 2

Starting with Scrapy 1.0 you can also place custom settings in spider, so you could do something like this:

class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
    ]

    custom_settings = {
        "DOWNLOAD_DELAY": 5,
        "CONCURRENT_REQUESTS_PER_DOMAIN": 2
    }

Delay and concurrency are set per downloader slot not per requests. To actually check what download you have you could try something like this

def parse(self, response):
    """
    """
    delay = self.crawler.engine.downloader.slots["www.dmoz.org"].delay
    concurrency = self.crawler.engine.downloader.slots["www.dmoz.org"].concurrency
    self.log("Delay {}, concurrency {} for request {}".format(delay, concurrency, response.request))
    return
Pawel Miech
  • 7,742
  • 4
  • 36
  • 57
  • Just to make note that it's possible to configure `download_delay` per spider even in version 0.24, as stated in the URL you linked to: `You can also change this setting per spider by setting download_delay spider attribute.` – bosnjak May 23 '15 at 12:02