Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

Designed with simplicity in mind
Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
Portable, open-source, 100% Python
Written in python and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.

Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using condaconda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.

Online resources:

Official site
Official docs
Git Repository
FAQ (see also Recent tab of scrapy tag)
Tutorial for beginners
Curated Scrapy links (libraries, related projects, etc)

17743 questions

votes

2 answers

Scrapy: AttributeError: 'list' object has no attribute 'iteritems'

This is my first question on stack overflow. Recently I want to use linked-in-scraper, so I downloaded and instruct "scrapy crawl linkedin.com" and get the below error message. For your information, I use anaconda 2.3.0 and python 2.7.11. All the…

python scrapy six

asked May 25 '16 at 16:27

user124697

votes

1 answer

Set headers for scrapy shell request

I know that you can scrapy shell -s USER_AGENT='custom user agent' 'http://www.example.com' to change the USER_AGENT, but how do you add request headers?

scrapy scrapy-shell

asked May 03 '16 at 17:23

Computer's Guy

5,122
8
54
74

votes

1 answer

Relative URL to absolute URL Scrapy

I need help to convert relative URL to absolute URL in Scrapy spider. I need to convert links on my start pages to absolute URL to get the images of the scrawled items, which are on the start pages. I unsuccessfully tried different ways to achieve…

scrapy

asked Mar 18 '16 at 13:38

jacquesseite

votes

3 answers

How do Scrapy rules work with crawl spider

I have hard time to understand scrapy crawl spider rules. I have example that doesn't work as I would like it did, so it can be two things: I don't understand how rules work. I formed incorrect regex that prevents me to get results that I need. OK…

python regex web-crawler scrapy

asked Feb 27 '14 at 23:28

Vy.Iv

votes

5 answers

How to set different scrapy-settings for different spiders?

I want to enable some http-proxy for some spiders, and disable them for other spiders. Can I do something like this? # settings.py proxy_spiders = ['a1' , b2'] if spider in proxy_spider: #how to get spider name ??? HTTP_PROXY =…

scrapy

asked Oct 11 '13 at 21:17

Michael Nguyen

1,691
2
18
33

votes

2 answers

Following hyperlink and "Filtered offsite request"

I know that there are several related threads out there, and they have helped me a lot, but I still can't get all the way. I am at the point where running the code doesn't result in errors, but I get nothing in my csv file. I have the following…

python callback web-scraping scrapy

asked Jul 25 '13 at 15:33

Mace

1,259
4
16
35

votes

4 answers

Speed up web scraper

I am scraping 23770 webpages with a pretty simple web scraper using scrapy. I am quite new to scrapy and even python, but managed to write a spider that does the job. It is, however, really slow (it takes approx. 28 hours to crawl the 23770…

python performance web-scraping scrapy

asked Jun 10 '13 at 17:42

Mace

1,259
4
16
35

votes

2 answers

Scrapy crawl from script always blocks script execution after scraping

I am following this guide http://doc.scrapy.org/en/0.16/topics/practices.html#run-scrapy-from-a-script to run scrapy from my script. Here is part of my script: crawler = Crawler(Settings(settings)) crawler.configure() spider =…

python twisted scrapy

asked Feb 08 '13 at 17:18

Eugene Nagorny

1,626
3
18
32

votes

2 answers

Scrapy:How to print request referrer

Is it possible to get the request referrer from the response object in parse function? 10x

python scrapy

asked Aug 21 '12 at 12:34

DjangoPy

votes

2 answers

Scrapy and response status code: how to check against it?

I'm using scrapy to crawl my sitemap, to check for 404, 302 and 200 pages. But i can't seem to be able to get the response code. This is my code so far: from scrapy.contrib.spiders import SitemapSpider class…

python scrapy http-status-codes

asked Mar 14 '12 at 08:40

Samuele Mattiuzzo

10,760
5
39
63

votes

4 answers

Access session cookie in scrapy spiders

I am trying to access the session cookie within a spider. I first login to a social network using in a spider: def parse(self, response): return [FormRequest.from_response(response, formname='login_form', …

session cookies session-cookies scrapy

asked Jan 03 '12 at 05:35

mikolune

votes

1 answer

How do I set up Scrapy to deal with a captcha

I'm trying to scrape a site that requires the user to enter the search value and a captcha. I've got an optical character recognition (OCR) routine for the captcha that succeeds about 33% of the time. Since the captchas are always alphabetic text, I…

python web-scraping scrapy captcha

asked Aug 25 '16 at 05:48

Sushil

votes

1 answer

An array field in scrapy.Item

I want to add a field to scrapy.Item so that it's an array: class MyItem(scrapy.Item): field1 = scrapy.Field() field2 = scrapy.Field() field3_array = ??? How can I do that?

python web-scraping scrapy

asked Mar 24 '15 at 07:33

Mario Honse

votes

4 answers

Is Scrapy single-threaded or multi-threaded?

There are few concurrency settings in Scrapy, like CONCURRENT_REQUESTS. Does it mean, that Scrapy crawler is multi-threaded? So if I run scrapy crawl my_crawler it will literally fire multiple simultaneous requests in parallel? Im asking because,…

python multithreading scrapy web-crawler

asked Jul 15 '14 at 14:38

Gill Bates

14,330
23
70
138

votes

6 answers

how to handle 302 redirect in scrapy

I am receiving a 302 response from a server while scrapping a website: 2014-04-01 21:31:51+0200 [ahrefs-h] DEBUG: Redirecting (302) to from

python scrapy http-status-code-302

asked Apr 01 '14 at 19:42

Upment

1,581
3
15
28

Prev 1 2 3

…

99 100 Next