Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

Designed with simplicity in mind
Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
Portable, open-source, 100% Python
Written in python and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.

Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using condaconda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.

Online resources:

Official site
Official docs
Git Repository
FAQ (see also Recent tab of scrapy tag)
Tutorial for beginners
Curated Scrapy links (libraries, related projects, etc)

17743 questions

votes

8 answers

scrapy text encoding

Here is my spider from scrapy.contrib.spiders import CrawlSpider,Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from vrisko.items import VriskoItem class…

scrapy

asked Feb 07 '12 at 17:51

mindcast

votes

10 answers

ReactorNotRestartable error in while loop with scrapy

I get twisted.internet.error.ReactorNotRestartable error when I execute following code: from time import sleep from scrapy import signals from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings from…

python python-2.7 scrapy twisted

asked Oct 09 '16 at 17:47

k_wit

votes

2 answers

Running Scrapy spiders in a Celery task

I have a Django site where a scrape happens when a user requests it, and my code kicks off a Scrapy spider standalone script in a new process. Naturally, this isn't working with an increase of users. Something like this: class…

python django scrapy celery

asked Jul 17 '12 at 18:36

stryderjzw

votes

2 answers

Scrapy css selector: get text of all inner tags

I have a tag and I want to get all the text inside available. I am doing this: response.css('mytag::text') But it is only getting the text of the current tag, I also want to get the text from all the inner tags. I know I could do something…

html css scrapy

asked Dec 05 '16 at 23:12

Jgaldos

votes

4 answers

Passing a argument to a callback function

def parse(self, response): for sel in response.xpath('//tbody/tr'): item = HeroItem() item['hclass'] = response.request.url.split("/")[8].split('-')[-1] item['server'] = response.request.url.split('/')[2].split('.')[0] …

python callback arguments scrapy

asked Aug 27 '15 at 14:30

vic

votes

5 answers

Run a Scrapy spider in a Celery Task

This is not working anymore, scrapy's API has changed. Now the documentation feature a way to "Run Scrapy from a script" but I get the ReactorNotRestartable error. My task: from celery import Task from twisted.internet import reactor from…

scrapy twisted celery

asked Mar 01 '14 at 15:46

Juan Riaza

1,618
2
16
35

votes

4 answers

Force my scrapy spider to stop crawling

is there a chance to stop crawling when specific if condition is true (like scrap_item_id == predefine_value ). My problem is similar to Scrapy - how to identify already scraped urls but I want to 'force' my scrapy spider to stop crawling after…

python scrapy

asked Dec 15 '10 at 10:05

no1

votes

2 answers

Is it possible to pass a variable from start_requests() to parse() for each individual request?

I'm using a loop to generate my requests inside start_request() and I'd like to pass the index to parse() so it can store it in the item. However when I use self.i the output has the i max value (last loop turn) for every items. I can use…

scrapy

asked Jan 01 '17 at 09:32

ChiseledAbs

1,963
6
19
33

votes

2 answers

CrawlerProcess vs CrawlerRunner

Scrapy 1.x documentation explains that there are two ways to run a Scrapy spider from a script: using CrawlerProcess using CrawlerRunner What is the difference between the two? When should I use "process" and when "runner"?

python web-scraping scrapy

asked Sep 26 '16 at 14:52

alecxe

462,703
120
1,088
1,195

votes

4 answers

Passing arguments to process.crawl in Scrapy python

I would like to get the same result as this command line : scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json My script is as follows : import scrapy from linkedin_anonymous_spider import LinkedInAnonymousSpider from…

python web-crawler scrapy google-crawlers

asked Dec 20 '15 at 15:06

yusuf

3,591
8
45
86

votes

2 answers

Access Django models with scrapy: defining path to Django project

I'm very new to Python and Django. I'm currently exploring using Scrapy to scrape sites and save data to the Django database. My goal is to run a spider based on domain given by a user. I've written a spider that extracts the data I need and store…

python django django-models scrapy

asked Sep 28 '13 at 15:02

Splurk

votes

6 answers

How to give URL to scrapy for crawling?

I want to use scrapy for crawling web pages. Is there a way to pass the start URL from the terminal itself? It is given in the documentation that either the name of the spider or the URL can be given, but when i given the url it throws an…

scrapy web-crawler

asked Mar 13 '12 at 09:11

G Gill

1,087
1
12
24

votes

3 answers

Scrapy: Follow link to get additional Item data?

I don't have a specific code issue I'm just not sure how to approach the following problem logistically with the Scrapy framework: The structure of the data I want to scrape is typically a table row for each item. Straightforward enough,…

hyperlink callback scrapy

asked Feb 17 '12 at 19:54

dru

votes

1 answer

python No module named service_identity

I tried to update scrapy and when I tried to check the version I got the following error C:\Windows\system32>scrapy version -v :0: UserWarning: You do not have a working installation of the service_identity module: 'No module named…

python python-2.7 scrapy

asked Jun 06 '14 at 19:40

Marco Dinatsoli

10,322
37
139
253

votes

3 answers

unknown command: crawl error

I am a newbie to python. I am running python 2.7.3 version 32 bit on 64 bit OS. (I tried 64 bit but it didn't workout). I followed the tutorial and installed scrapy on my machine. I have created one project, demoz. But when I enter scrapy crawl…

python scrapy web-crawler

asked Apr 12 '12 at 11:58

Nits

Prev 1 2

…

99 100 Next