Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen and web framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

  • Designed with simplicity in mind
  • Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
  • Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
  • Portable, open-source, 100% Python
  • Written in and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.


Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using conda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

enter image description here


Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.


Online resources:

17743 questions
42
votes
8 answers

scrapy text encoding

Here is my spider from scrapy.contrib.spiders import CrawlSpider,Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from vrisko.items import VriskoItem class…
mindcast
  • 747
  • 1
  • 9
  • 12
39
votes
10 answers

ReactorNotRestartable error in while loop with scrapy

I get twisted.internet.error.ReactorNotRestartable error when I execute following code: from time import sleep from scrapy import signals from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings from…
k_wit
  • 491
  • 1
  • 4
  • 5
39
votes
2 answers

Running Scrapy spiders in a Celery task

I have a Django site where a scrape happens when a user requests it, and my code kicks off a Scrapy spider standalone script in a new process. Naturally, this isn't working with an increase of users. Something like this: class…
stryderjzw
  • 582
  • 1
  • 6
  • 9
38
votes
2 answers

Scrapy css selector: get text of all inner tags

I have a tag and I want to get all the text inside available. I am doing this: response.css('mytag::text') But it is only getting the text of the current tag, I also want to get the text from all the inner tags. I know I could do something…
Jgaldos
  • 540
  • 1
  • 5
  • 9
37
votes
4 answers

Passing a argument to a callback function

def parse(self, response): for sel in response.xpath('//tbody/tr'): item = HeroItem() item['hclass'] = response.request.url.split("/")[8].split('-')[-1] item['server'] = response.request.url.split('/')[2].split('.')[0] …
vic
  • 371
  • 1
  • 3
  • 4
37
votes
5 answers

Run a Scrapy spider in a Celery Task

This is not working anymore, scrapy's API has changed. Now the documentation feature a way to "Run Scrapy from a script" but I get the ReactorNotRestartable error. My task: from celery import Task from twisted.internet import reactor from…
Juan Riaza
  • 1,618
  • 2
  • 16
  • 35
36
votes
4 answers

Force my scrapy spider to stop crawling

is there a chance to stop crawling when specific if condition is true (like scrap_item_id == predefine_value ). My problem is similar to Scrapy - how to identify already scraped urls but I want to 'force' my scrapy spider to stop crawling after…
no1
  • 925
  • 2
  • 9
  • 9
36
votes
2 answers

Is it possible to pass a variable from start_requests() to parse() for each individual request?

I'm using a loop to generate my requests inside start_request() and I'd like to pass the index to parse() so it can store it in the item. However when I use self.i the output has the i max value (last loop turn) for every items. I can use…
ChiseledAbs
  • 1,963
  • 6
  • 19
  • 33
36
votes
2 answers

CrawlerProcess vs CrawlerRunner

Scrapy 1.x documentation explains that there are two ways to run a Scrapy spider from a script: using CrawlerProcess using CrawlerRunner What is the difference between the two? When should I use "process" and when "runner"?
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
36
votes
4 answers

Passing arguments to process.crawl in Scrapy python

I would like to get the same result as this command line : scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json My script is as follows : import scrapy from linkedin_anonymous_spider import LinkedInAnonymousSpider from…
yusuf
  • 3,591
  • 8
  • 45
  • 86
36
votes
2 answers

Access Django models with scrapy: defining path to Django project

I'm very new to Python and Django. I'm currently exploring using Scrapy to scrape sites and save data to the Django database. My goal is to run a spider based on domain given by a user. I've written a spider that extracts the data I need and store…
Splurk
  • 711
  • 1
  • 5
  • 15
35
votes
6 answers

How to give URL to scrapy for crawling?

I want to use scrapy for crawling web pages. Is there a way to pass the start URL from the terminal itself? It is given in the documentation that either the name of the spider or the URL can be given, but when i given the url it throws an…
G Gill
  • 1,087
  • 1
  • 12
  • 24
35
votes
3 answers

Scrapy: Follow link to get additional Item data?

I don't have a specific code issue I'm just not sure how to approach the following problem logistically with the Scrapy framework: The structure of the data I want to scrape is typically a table row for each item. Straightforward enough,…
dru
  • 698
  • 1
  • 9
  • 11
35
votes
1 answer

python No module named service_identity

I tried to update scrapy and when I tried to check the version I got the following error C:\Windows\system32>scrapy version -v :0: UserWarning: You do not have a working installation of the service_identity module: 'No module named…
Marco Dinatsoli
  • 10,322
  • 37
  • 139
  • 253
35
votes
3 answers

unknown command: crawl error

I am a newbie to python. I am running python 2.7.3 version 32 bit on 64 bit OS. (I tried 64 bit but it didn't workout). I followed the tutorial and installed scrapy on my machine. I have created one project, demoz. But when I enter scrapy crawl…
Nits
  • 629
  • 1
  • 7
  • 16