Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen and web framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

  • Designed with simplicity in mind
  • Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
  • Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
  • Portable, open-source, 100% Python
  • Written in and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.


Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using conda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

enter image description here


Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.


Online resources:

17743 questions
34
votes
8 answers

Access django models inside of Scrapy

Is it possible to access my django models inside of a Scrapy pipeline, so that I can save my scraped data straight to my model? I've seen this, but I don't really get how to set it up?
imns
  • 4,996
  • 11
  • 57
  • 80
34
votes
6 answers

Scrapy throws ImportError: cannot import name xmlrpc_client

After install Scrapy via pip, and having Python 2.7.10: scrapy Traceback (most recent call last): File "/usr/local/bin/scrapy", line 7, in from scrapy.cmdline import execute File "/Library/Python/2.7/site-packages/scrapy/__init__.py", line…
Ignasi
  • 5,887
  • 7
  • 45
  • 81
34
votes
3 answers

Scrapy: how to disable or change log?

I've followed the official tutoral of Scrapy, it's wonderful! I'd like to remove all of DEBUG messages from console output. Is there a way? 2013-06-08 14:51:48+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6029 2013-06-08 14:51:48+0000…
realtebo
  • 23,922
  • 37
  • 112
  • 189
34
votes
4 answers

How to access scrapy settings from item Pipeline

How do I access the scrapy settings in settings.py from the item pipeline. The documentation mentions it can be accessed through the crawler in extensions, but I don't see how to access the crawler in the pipelines.
avaleske
  • 1,793
  • 5
  • 16
  • 26
33
votes
4 answers

Crawling with an authenticated session in Scrapy

In my previous question, I wasn't very specific over my problem (scraping with an authenticated session with Scrapy), in the hopes of being able to deduce the solution from a more general answer. I should probably rather have used the word…
Herman Schaaf
  • 46,821
  • 21
  • 100
  • 139
32
votes
6 answers

Best way for a beginner to learn screen scraping by Python

This might be one of those questions that are difficult to answer, but here goes: I don't consider my self programmer - but I would like to :-) I've learned R, because I was sick and tired of spss, and because a friend introduced me to the language…
Andreas
  • 6,612
  • 14
  • 59
  • 69
32
votes
3 answers

InterfaceError: connection already closed (using django + celery + Scrapy)

I am getting this when using a Scrapy parsing function (that can take till 10 minutes sometimes) inside a Celery task. I use: - Django==1.6.5 - django-celery==3.1.16 - celery==3.1.16 - psycopg2==2.5.5 (I used also psycopg2==2.5.4) [2015-07-19…
mou55
  • 660
  • 1
  • 8
  • 13
31
votes
3 answers

Send Post Request in Scrapy

I am trying to crawl the latest reviews from google play store and to get that I need to make a post request. With the Postman, it works and I get desired response. but a post request in terminal gives me a server error For ex: this page…
Amit Tripathi
  • 7,003
  • 6
  • 32
  • 58
31
votes
9 answers

suppress Scrapy Item printed in logs after pipeline

I have a scrapy project where the item that ultimately enters my pipeline is relatively large and stores lots of metadata and content. Everything is working properly in my spider and pipelines. The logs, however, are printing out the entire scrapy…
dino
  • 3,093
  • 4
  • 31
  • 50
31
votes
7 answers

scraping the file with html saved in local system

For example i had a site "www.example.com" Actually i want to scrape the html of this site by saving on to local system. so for testing i saved that page on my desktop as example.html Now i had written the spider code for this as below class…
Shiva Krishna Bavandla
  • 25,548
  • 75
  • 193
  • 313
30
votes
5 answers

MongoDB InvalidDocument: Cannot encode object

I am using scrapy to scrap blogs and then store the data in mongodb. At first i got the InvalidDocument Exception. So obvious to me is that the data is not in the right encoding. So before persisting the object, in my MongoPipeline i check if the…
Codious-JR
  • 1,658
  • 3
  • 26
  • 48
30
votes
3 answers

scrapy: convert html string to HtmlResponse object

I have a raw html string that I want to convert to scrapy HTML response object so that I can use the selectors css and xpath, similar to scrapy's response. How can I do it?
yayu
  • 7,758
  • 17
  • 54
  • 86
30
votes
3 answers

scrapy - parsing items that are paginated

I have a url of the form: example.com/foo/bar/page_1.html There are a total of 53 pages, each one of them has ~20 rows. I basically want to get all the rows from all the pages, i.e. ~53*20 items. I have working code in my parse method, that parses…
AlexBrand
  • 11,971
  • 20
  • 87
  • 132
29
votes
6 answers

Scrapy - Reactor not Restartable

with: from twisted.internet import reactor from scrapy.crawler import CrawlerProcess I've always ran this process sucessfully: process = CrawlerProcess(get_project_settings()) process.crawl(*args) # the script will block here until the crawling is…
8-Bit Borges
  • 9,643
  • 29
  • 101
  • 198
29
votes
3 answers

How to bypass cloudflare bot/ddos protection in Scrapy?

I used to scrape e-commerce webpage occasionally to get product prices information. I have not used the scraper built using Scrapy in a while and yesterday was trying to use it - I run into a problem with bot protection. It is using CloudFlare’s…
Kulbi
  • 961
  • 1
  • 10
  • 16