Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

Designed with simplicity in mind
Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
Portable, open-source, 100% Python
Written in python and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.

Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using condaconda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.

Online resources:

Official site
Official docs
Git Repository
FAQ (see also Recent tab of scrapy tag)
Tutorial for beginners
Curated Scrapy links (libraries, related projects, etc)

17743 questions

votes

3 answers

Scrapy: non-blocking pause

I have a problem. I need to stop the execution of a function for a while, but not stop the implementation of parsing as a whole. That is, I need a non-blocking pause. It's looks like: class ScrapySpider(Spider): name = 'live_function' def…

python scrapy

asked May 02 '16 at 14:18

JRazor

2,707
18
27

votes

5 answers

How to send cookie with scrapy CrawlSpider requests?

I am trying to create this Reddit scraper using Python's Scrapy framework. I have used the CrawSpider to crawl through Reddit and its subreddits. But, when I come across pages that have adult content, the site asks for a cookie over18=1. So, I have…

python cookies web-scraping scrapy

asked Sep 17 '15 at 05:43

Parthapratim Neog

4,352
6
43
79

votes

2 answers

Scrapy Shell - How to change USER_AGENT

I have a fully functioning scrapy script to extract data from a website. During setup, the target site banned me based on my USER_AGENT information. I subsequently added a RotateUserAgentMiddleware to rotate the USER_AGENT randomly. This works…

python shell scrapy agent

asked Aug 21 '14 at 15:00

dfriestedt

votes

1 answer

How does scrapy use rules?

I'm new to using Scrapy and I wanted to understand how the rules are being used within the CrawlSpider. If I have a rule where I'm crawling through the yellowpages for cupcake listings in Tucson, AZ, how does yielding a URL request activate the…

python scrapy response

asked Aug 17 '14 at 07:48

OfLettersAndNumbers

votes

3 answers

When and how should use multiple spiders in one Scrapy project

I am using Scrapy, it is great! so fast to build a crawler. with the number of web sites are increasing, need to create new spiders, but these web sits are the same type, all these spiders use same items, pipelines, parsing process the contents…

python scrapy

asked Aug 01 '14 at 02:19

user3337861

votes

1 answer

scrapy item loader return list not single value

I am using scrapy 0.20. I want to use item loader this is my code: l = XPathItemLoader(item=MyItemClass(), response=response) l.add_value('url', response.url) l.add_xpath('title',"my xpath") l.add_xpath('developer', "my…

python python-2.7 web-scraping scrapy

asked May 27 '14 at 16:09

Marco Dinatsoli

10,322
37
139
253

votes

2 answers

Scrapy, scraping data inside a Javascript

I am using scrapy to screen scrape data from a website. However, the data I wanted wasn't inside the html itself, instead, it is from a javascript. So, my question is: How to get the values (text values) of such cases? This, is the site I'm trying…

python screen-scraping scrapy

asked Sep 26 '13 at 07:04

HeadAboutToExplode

votes

2 answers

Scrapy with Privoxy and Tor: how to renew IP

I am dealing with Scrapy, Privoxy and Tor. I have all installed and properly working. But Tor connects with the same IP everytime, so I can easily be banned. Is it possible to tell Tor to reconnect each X seconds or connections? Thanks! EDIT about…

python web-scraping scrapy tor

asked Jul 10 '17 at 10:41

user7499416

votes

8 answers

Strip \n \t \r in scrapy

I'm trying to strip \r \n \t characters with a scrapy spider, making then a json file. I have a "description" object which is full of new lines, and it doesn't do what I want: matching each description to a title. I tried with map(unicode.strip())…

python unicode scrapy

asked Feb 09 '16 at 09:24

Lara M.

votes

5 answers

How To Turn Off Logging in Scrapy (Python)

I have created a spider using Scrapy but I cannot figure out how to turn off the default logging. From the documentation it appears that I should be able to turn it off by doing logging.basicConfig(level=logging.ERROR) But this has no…

python logging scrapy

asked Oct 18 '15 at 21:44

Dr. Pain

votes

5 answers

How can i extract only text in scrapy selector in python

I have this code site = hxs.select("//h1[@class='state']") log.msg(str(site[0].extract()),level=log.ERROR) The ouput is [scrapy] ERROR:

1 job containing php…

python scrapy

asked Nov 21 '12 at 08:51

Mirage

30,868
62
166
261

votes

3 answers

How can I make scrapy crawl break and exit when encountering the first exception?

For development purposes, I would like to stop all scrapy crawling activity as soon a first exception (in a spider or a pipeline) occurs. Any advice?

python exception scrapy

asked Mar 01 '12 at 22:06

Udi

29,222
9
96
129

votes

4 answers

How do I merge results from target page to current page in scrapy?

Need example in scrapy on how to get a link from one page, then follow this link, get more info from the linked page, and merge back with some data from first page.

python web-scraping scrapy

asked Dec 11 '11 at 21:38

Jas

14,493
27
97
148

votes

4 answers

Scrapy: HTTP status code is not handled or not allowed?

I want to get product title,link,price in category https://tiki.vn/dien-thoai-may-tinh-bang/c1789 But it fails "HTTP status code is not handled or not allowed": My file: spiders/tiki.py import scrapy from scrapy.linkextractors import…

python scrapy web-crawler

asked Oct 14 '17 at 16:20

gait

votes

2 answers

Fatal error C1083: Cannot open include file: 'openssl/opensslv.h'

I'm trying to install Scrapy, but got this error during installing: build\temp.win-amd64-2.7\Release_openssl.c(429) : fatal error C1083: Cannot open include file: 'openssl/opensslv.h': No such file or directory I've checked that the file…

python openssl cryptography scrapy

asked Jun 21 '16 at 17:52

kiral

Prev 1 2 3

…

99 100 Next