Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen and web framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

  • Designed with simplicity in mind
  • Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
  • Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
  • Portable, open-source, 100% Python
  • Written in and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.


Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using conda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

enter image description here


Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.


Online resources:

17743 questions
20
votes
2 answers

Scrapy: AttributeError: 'list' object has no attribute 'iteritems'

This is my first question on stack overflow. Recently I want to use linked-in-scraper, so I downloaded and instruct "scrapy crawl linkedin.com" and get the below error message. For your information, I use anaconda 2.3.0 and python 2.7.11. All the…
user124697
  • 325
  • 1
  • 3
  • 7
20
votes
1 answer

Set headers for scrapy shell request

I know that you can scrapy shell -s USER_AGENT='custom user agent' 'http://www.example.com' to change the USER_AGENT, but how do you add request headers?
Computer's Guy
  • 5,122
  • 8
  • 54
  • 74
20
votes
1 answer

Relative URL to absolute URL Scrapy

I need help to convert relative URL to absolute URL in Scrapy spider. I need to convert links on my start pages to absolute URL to get the images of the scrawled items, which are on the start pages. I unsuccessfully tried different ways to achieve…
jacquesseite
  • 515
  • 3
  • 12
20
votes
3 answers

How do Scrapy rules work with crawl spider

I have hard time to understand scrapy crawl spider rules. I have example that doesn't work as I would like it did, so it can be two things: I don't understand how rules work. I formed incorrect regex that prevents me to get results that I need. OK…
Vy.Iv
  • 829
  • 2
  • 8
  • 17
20
votes
5 answers

How to set different scrapy-settings for different spiders?

I want to enable some http-proxy for some spiders, and disable them for other spiders. Can I do something like this? # settings.py proxy_spiders = ['a1' , b2'] if spider in proxy_spider: #how to get spider name ??? HTTP_PROXY =…
Michael Nguyen
  • 1,691
  • 2
  • 18
  • 33
20
votes
2 answers

Following hyperlink and "Filtered offsite request"

I know that there are several related threads out there, and they have helped me a lot, but I still can't get all the way. I am at the point where running the code doesn't result in errors, but I get nothing in my csv file. I have the following…
Mace
  • 1,259
  • 4
  • 16
  • 35
20
votes
4 answers

Speed up web scraper

I am scraping 23770 webpages with a pretty simple web scraper using scrapy. I am quite new to scrapy and even python, but managed to write a spider that does the job. It is, however, really slow (it takes approx. 28 hours to crawl the 23770…
Mace
  • 1,259
  • 4
  • 16
  • 35
20
votes
2 answers

Scrapy crawl from script always blocks script execution after scraping

I am following this guide http://doc.scrapy.org/en/0.16/topics/practices.html#run-scrapy-from-a-script to run scrapy from my script. Here is part of my script: crawler = Crawler(Settings(settings)) crawler.configure() spider =…
Eugene Nagorny
  • 1,626
  • 3
  • 18
  • 32
20
votes
2 answers

Scrapy:How to print request referrer

Is it possible to get the request referrer from the response object in parse function? 10x
DjangoPy
  • 855
  • 1
  • 13
  • 29
19
votes
2 answers

Scrapy and response status code: how to check against it?

I'm using scrapy to crawl my sitemap, to check for 404, 302 and 200 pages. But i can't seem to be able to get the response code. This is my code so far: from scrapy.contrib.spiders import SitemapSpider class…
Samuele Mattiuzzo
  • 10,760
  • 5
  • 39
  • 63
19
votes
4 answers

Access session cookie in scrapy spiders

I am trying to access the session cookie within a spider. I first login to a social network using in a spider: def parse(self, response): return [FormRequest.from_response(response, formname='login_form', …
mikolune
  • 241
  • 1
  • 3
  • 5
19
votes
1 answer

How do I set up Scrapy to deal with a captcha

I'm trying to scrape a site that requires the user to enter the search value and a captcha. I've got an optical character recognition (OCR) routine for the captcha that succeeds about 33% of the time. Since the captchas are always alphabetic text, I…
Sushil
  • 371
  • 1
  • 3
  • 11
19
votes
1 answer

An array field in scrapy.Item

I want to add a field to scrapy.Item so that it's an array: class MyItem(scrapy.Item): field1 = scrapy.Field() field2 = scrapy.Field() field3_array = ??? How can I do that?
Mario Honse
  • 289
  • 1
  • 3
  • 10
19
votes
4 answers

Is Scrapy single-threaded or multi-threaded?

There are few concurrency settings in Scrapy, like CONCURRENT_REQUESTS. Does it mean, that Scrapy crawler is multi-threaded? So if I run scrapy crawl my_crawler it will literally fire multiple simultaneous requests in parallel? Im asking because,…
Gill Bates
  • 14,330
  • 23
  • 70
  • 138
19
votes
6 answers

how to handle 302 redirect in scrapy

I am receiving a 302 response from a server while scrapping a website: 2014-04-01 21:31:51+0200 [ahrefs-h] DEBUG: Redirecting (302) to from
Upment
  • 1,581
  • 3
  • 15
  • 28