Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

Designed with simplicity in mind
Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
Portable, open-source, 100% Python
Written in python and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.

Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using condaconda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.

Online resources:

Official site
Official docs
Git Repository
FAQ (see also Recent tab of scrapy tag)
Tutorial for beginners
Curated Scrapy links (libraries, related projects, etc)

17743 questions

votes

8 answers

Access django models inside of Scrapy

Is it possible to access my django models inside of a Scrapy pipeline, so that I can save my scraped data straight to my model? I've seen this, but I don't really get how to set it up?

python django django-models scrapy

asked Nov 24 '10 at 22:09

imns

4,996
11
57
80

votes

6 answers

Scrapy throws ImportError: cannot import name xmlrpc_client

After install Scrapy via pip, and having Python 2.7.10: scrapy Traceback (most recent call last): File "/usr/local/bin/scrapy", line 7, in from scrapy.cmdline import execute File "/Library/Python/2.7/site-packages/scrapy/__init__.py", line…

python python-2.7 scrapy

asked Jun 21 '15 at 13:06

Ignasi

5,887
7
45
81

votes

3 answers

Scrapy: how to disable or change log?

I've followed the official tutoral of Scrapy, it's wonderful! I'd like to remove all of DEBUG messages from console output. Is there a way? 2013-06-08 14:51:48+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6029 2013-06-08 14:51:48+0000…

python scrapy

asked Jun 08 '13 at 14:58

realtebo

23,922
37
112
189

votes

4 answers

How to access scrapy settings from item Pipeline

How do I access the scrapy settings in settings.py from the item pipeline. The documentation mentions it can be accessed through the crawler in extensions, but I don't see how to access the crawler in the pipelines.

python scrapy settings pipeline

asked Dec 28 '12 at 21:19

avaleske

1,793
5
16
26

votes

4 answers

Crawling with an authenticated session in Scrapy

In my previous question, I wasn't very specific over my problem (scraping with an authenticated session with Scrapy), in the hopes of being able to deduce the solution from a more general answer. I should probably rather have used the word…

python scrapy

asked May 01 '11 at 20:34

Herman Schaaf

46,821
21
100
139

votes

6 answers

Best way for a beginner to learn screen scraping by Python

This might be one of those questions that are difficult to answer, but here goes: I don't consider my self programmer - but I would like to :-) I've learned R, because I was sick and tired of spss, and because a friend introduced me to the language…

python screen-scraping beautifulsoup lxml scrapy

asked Dec 01 '10 at 19:31

Andreas

6,612
14
59
69

votes

3 answers

InterfaceError: connection already closed (using django + celery + Scrapy)

I am getting this when using a Scrapy parsing function (that can take till 10 minutes sometimes) inside a Celery task. I use: - Django==1.6.5 - django-celery==3.1.16 - celery==3.1.16 - psycopg2==2.5.5 (I used also psycopg2==2.5.4) [2015-07-19…

python django scrapy celery

asked Jul 19 '15 at 18:37

mou55

votes

3 answers

Send Post Request in Scrapy

I am trying to crawl the latest reviews from google play store and to get that I need to make a post request. With the Postman, it works and I get desired response. but a post request in terminal gives me a server error For ex: this page…

python python-3.x scrapy web-crawler

asked May 20 '15 at 06:49

Amit Tripathi

7,003
6
32
58

votes

9 answers

suppress Scrapy Item printed in logs after pipeline

I have a scrapy project where the item that ultimately enters my pipeline is relatively large and stores lots of metadata and content. Everything is working properly in my spider and pipelines. The logs, however, are printing out the entire scrapy…

python scrapy

asked Jan 18 '13 at 01:06

dino

3,093
4
31
50

votes

7 answers

scraping the file with html saved in local system

For example i had a site "www.example.com" Actually i want to scrape the html of this site by saving on to local system. so for testing i saved that page on my desktop as example.html Now i had written the spider code for this as below class…

python scrapy

asked Jun 05 '12 at 10:12

Shiva Krishna Bavandla

25,548
75
193
313

votes

5 answers

MongoDB InvalidDocument: Cannot encode object

I am using scrapy to scrap blogs and then store the data in mongodb. At first i got the InvalidDocument Exception. So obvious to me is that the data is not in the right encoding. So before persisting the object, in my MongoPipeline i check if the…

python mongodb encoding scrapy

asked Nov 04 '15 at 14:34

Codious-JR

1,658
3
26
48

votes

3 answers

scrapy: convert html string to HtmlResponse object

I have a raw html string that I want to convert to scrapy HTML response object so that I can use the selectors css and xpath, similar to scrapy's response. How can I do it?

python web-scraping scrapy

asked Dec 05 '14 at 19:59

yayu

7,758
17
54
86

votes

3 answers

scrapy - parsing items that are paginated

I have a url of the form: example.com/foo/bar/page_1.html There are a total of 53 pages, each one of them has ~20 rows. I basically want to get all the rows from all the pages, i.e. ~53*20 items. I have working code in my parse method, that parses…

python scrapy

asked Oct 11 '12 at 20:26

AlexBrand

11,971
20
87
132

votes

6 answers

Scrapy - Reactor not Restartable

with: from twisted.internet import reactor from scrapy.crawler import CrawlerProcess I've always ran this process sucessfully: process = CrawlerProcess(get_project_settings()) process.crawl(*args) # the script will block here until the crawling is…

python scrapy web-crawler

asked Jan 05 '17 at 21:32

8-Bit Borges

9,643
29
101
198

votes

3 answers

How to bypass cloudflare bot/ddos protection in Scrapy?

I used to scrape e-commerce webpage occasionally to get product prices information. I have not used the scraper built using Scrapy in a while and yesterday was trying to use it - I run into a problem with bot protection. It is using CloudFlare’s…

javascript python cookies scrapy

asked Oct 20 '15 at 22:07

Kulbi

Prev 1 2 3

…

99 100 Next