Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

Designed with simplicity in mind
Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
Portable, open-source, 100% Python
Written in python and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.

Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using condaconda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.

Online resources:

Official site
Official docs
Git Repository
FAQ (see also Recent tab of scrapy tag)
Tutorial for beginners
Curated Scrapy links (libraries, related projects, etc)

17743 questions

votes

2 answers

How to force scrapy to crawl duplicate url?

I am learning Scrapy a web crawling framework. by default it does not crawl duplicate urls or urls which scrapy have already crawled. How to make Scrapy to crawl duplicate urls or urls which have already crawled? I tried to find out on internet…

python web-crawler scrapy

asked Apr 17 '14 at 10:57

Alok

7,734
8
55
100

votes

7 answers

Missing scheme in request URL

I've been stuck on this bug for a while, the following error message is as follows: File "C:\Python27\lib\site-packages\scrapy-0.20.2-py2.7.egg\scrapy\http\request\__init__.py", line 61, in _set_url raise ValueError('Missing scheme in…

python url scrapy

asked Jan 13 '14 at 23:39

Toby

votes

6 answers

pyconfig.h missing during "pip install cryptography"

I wanna set up scrapy cluster follow this link scrapy-cluster,Everything is ok before I run this command: pip install -r requirements.txt The requirements.txt looks…

python cryptography centos scrapy pip

asked Oct 14 '16 at 08:09

FancyXun

1,218
1
9
17

votes

6 answers

Scrapy - Silently drop an item

I am using Scrapy to crawl several websites, which may share redundant information. For each page I scrape, I store the url of the page, its title and its html code, into mongoDB. I want to avoid duplication in database, thus, I implement a pipeline…

python scrapy

asked Nov 23 '12 at 11:13

Balthazar Rouberol

6,822
2
35
41

votes

2 answers

How to generate the start_urls dynamically in crawling?

I am crawling a site which may contain a lot of start_urls, like: http://www.a.com/list_1_2_3.htm I want to populate start_urls like [list_\d+_\d+_\d+\.htm], and extract items from URLs like [node_\d+\.htm] during crawling. Can I use CrawlSpider…

web-scraping scrapy web-crawler

asked Feb 17 '12 at 02:49

user1215269

votes

13 answers

Scrapy Crawl URLs in Order

So, my problem is relatively simple. I have one spider crawling multiple sites, and I need it to return the data in the order I write it in my code. It's posted below. from scrapy.spider import BaseSpider from scrapy.selector import…

python sorting asynchronous hashmap scrapy

asked Jul 04 '11 at 00:09

Jeff

votes

3 answers

Scrapy Shell and Scrapy Splash

We've been using scrapy-splash middleware to pass the scraped HTML source through the Splash javascript engine running inside a docker container. If we want to use Splash in the spider, we configure several required project settings and yield a…

web-scraping scrapy scrapy-splash scrapy-shell splash-js-render

asked Feb 11 '16 at 23:56

alecxe

462,703
120
1,088
1,195

votes

3 answers

For scrapy/selenium is there a way to go back to a previous page?

I essentially have a start_url that has my javascript search form and button, hence the need of selenium. I use selenium to select the appropriate items in my select box objects, and click the search button. The following page, I do some scrapy…

python selenium scrapy

asked May 19 '15 at 03:51

petermaxstack

votes

2 answers

How to disable or change the path of ghostdriver.log?

Question is straightfoward, but some context may help. I'm trying to deploy scrapy while using selenium and phantomjs as downloader. But the problem is that it keeps on saying permission denied when trying to deploy. So I want to change the path of…

scrapy phantomjs ghostdriver

asked Jun 11 '13 at 15:57

Sam Stoelinga

4,881
7
39
54

votes

1 answer

How to drop a collection with pymongo?

I use scarpy to crawl data and save it to cloud hosting mLab successfully with MongoDB. My collection name is recently and data's count is 5. I want to crawl data again and update my collection recently, so i try to drop the collection and then…

python scrapy pymongo

asked Feb 22 '18 at 09:24

Morton

5,380
18
63
118

votes

1 answer

ScrapyRT vs Scrapyd

We've been using Scrapyd service for a while up until now. It provides a nice wrapper around a scrapy project and its spiders letting to control the spiders via an HTTP API: Scrapyd is a service for running Scrapy spiders. It allows you to deploy…

python web-scraping scrapy scrapyd

asked May 17 '16 at 18:16

alecxe

462,703
120
1,088
1,195

votes

4 answers

Geopy: catch timeout error

I am using geopy to geocode some addresses and I want to catch the timeout errors and print them out so I can do some quality control on the input. I am putting the geocode request in a try/catch but it's not working. Any ideas on what I need to do?…

python scrapy geopy

asked Jan 13 '15 at 03:58

MoreScratch

2,933
6
34
65

votes

2 answers

Can Scrapy be replaced by pyspider?

I've been using Scrapy web-scraping framework pretty extensively, but, recently I've discovered that there is another framework/system called pyspider, which, according to it's github page, is fresh, actively developed and popular. pyspider's home…

python web-scraping scrapy web-crawler pyspider

asked Dec 02 '14 at 06:33

alecxe

462,703
120
1,088
1,195

votes

4 answers

scrapy from script output in json

I am running scrapy in a python script def setup_crawler(domain): dispatcher.connect(stop_reactor, signal=signals.spider_closed) spider = ArgosSpider(domain=domain) settings = get_project_settings() crawler = Crawler(settings) …

python json web-scraping scrapy

asked May 09 '14 at 22:02

Wasif Khalil

2,217
9
33
58

votes

2 answers

Scrapy Very Basic Example

Hi I have Python Scrapy installed on my mac and I was trying to follow the very first example on their web. They were trying to run the command: scrapy crawl mininova.org -o scraped_data.json -t json I don't quite understand what does this mean?…

python web-scraping scrapy

asked Sep 16 '13 at 22:40

B.Mr.W.

18,910
35
114
178

Prev 1 2 3

…

99 100 Next