Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen and web framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

  • Designed with simplicity in mind
  • Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
  • Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
  • Portable, open-source, 100% Python
  • Written in and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.


Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using conda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

enter image description here


Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.


Online resources:

17743 questions
23
votes
3 answers

Scrapy: non-blocking pause

I have a problem. I need to stop the execution of a function for a while, but not stop the implementation of parsing as a whole. That is, I need a non-blocking pause. It's looks like: class ScrapySpider(Spider): name = 'live_function' def…
JRazor
  • 2,707
  • 18
  • 27
23
votes
5 answers

How to send cookie with scrapy CrawlSpider requests?

I am trying to create this Reddit scraper using Python's Scrapy framework. I have used the CrawSpider to crawl through Reddit and its subreddits. But, when I come across pages that have adult content, the site asks for a cookie over18=1. So, I have…
Parthapratim Neog
  • 4,352
  • 6
  • 43
  • 79
23
votes
2 answers

Scrapy Shell - How to change USER_AGENT

I have a fully functioning scrapy script to extract data from a website. During setup, the target site banned me based on my USER_AGENT information. I subsequently added a RotateUserAgentMiddleware to rotate the USER_AGENT randomly. This works…
dfriestedt
  • 483
  • 1
  • 3
  • 18
23
votes
1 answer

How does scrapy use rules?

I'm new to using Scrapy and I wanted to understand how the rules are being used within the CrawlSpider. If I have a rule where I'm crawling through the yellowpages for cupcake listings in Tucson, AZ, how does yielding a URL request activate the…
OfLettersAndNumbers
  • 822
  • 1
  • 12
  • 22
23
votes
3 answers

When and how should use multiple spiders in one Scrapy project

I am using Scrapy, it is great! so fast to build a crawler. with the number of web sites are increasing, need to create new spiders, but these web sits are the same type, all these spiders use same items, pipelines, parsing process the contents…
user3337861
  • 245
  • 2
  • 8
23
votes
1 answer

scrapy item loader return list not single value

I am using scrapy 0.20. I want to use item loader this is my code: l = XPathItemLoader(item=MyItemClass(), response=response) l.add_value('url', response.url) l.add_xpath('title',"my xpath") l.add_xpath('developer', "my…
Marco Dinatsoli
  • 10,322
  • 37
  • 139
  • 253
23
votes
2 answers

Scrapy, scraping data inside a Javascript

I am using scrapy to screen scrape data from a website. However, the data I wanted wasn't inside the html itself, instead, it is from a javascript. So, my question is: How to get the values (text values) of such cases? This, is the site I'm trying…
HeadAboutToExplode
  • 275
  • 1
  • 3
  • 7
22
votes
2 answers

Scrapy with Privoxy and Tor: how to renew IP

I am dealing with Scrapy, Privoxy and Tor. I have all installed and properly working. But Tor connects with the same IP everytime, so I can easily be banned. Is it possible to tell Tor to reconnect each X seconds or connections? Thanks! EDIT about…
user7499416
22
votes
8 answers

Strip \n \t \r in scrapy

I'm trying to strip \r \n \t characters with a scrapy spider, making then a json file. I have a "description" object which is full of new lines, and it doesn't do what I want: matching each description to a title. I tried with map(unicode.strip())…
Lara M.
  • 855
  • 2
  • 10
  • 23
22
votes
5 answers

How To Turn Off Logging in Scrapy (Python)

I have created a spider using Scrapy but I cannot figure out how to turn off the default logging. From the documentation it appears that I should be able to turn it off by doing logging.basicConfig(level=logging.ERROR) But this has no…
Dr. Pain
  • 689
  • 1
  • 7
  • 16
22
votes
5 answers

How can i extract only text in scrapy selector in python

I have this code site = hxs.select("//h1[@class='state']") log.msg(str(site[0].extract()),level=log.ERROR) The ouput is [scrapy] ERROR:

1 job containing php

Mirage
  • 30,868
  • 62
  • 166
  • 261
21
votes
3 answers

How can I make scrapy crawl break and exit when encountering the first exception?

For development purposes, I would like to stop all scrapy crawling activity as soon a first exception (in a spider or a pipeline) occurs. Any advice?
Udi
  • 29,222
  • 9
  • 96
  • 129
21
votes
4 answers

How do I merge results from target page to current page in scrapy?

Need example in scrapy on how to get a link from one page, then follow this link, get more info from the linked page, and merge back with some data from first page.
Jas
  • 14,493
  • 27
  • 97
  • 148
21
votes
4 answers

Scrapy: HTTP status code is not handled or not allowed?

I want to get product title,link,price in category https://tiki.vn/dien-thoai-may-tinh-bang/c1789 But it fails "HTTP status code is not handled or not allowed": My file: spiders/tiki.py import scrapy from scrapy.linkextractors import…
gait
  • 331
  • 1
  • 3
  • 11
21
votes
2 answers

Fatal error C1083: Cannot open include file: 'openssl/opensslv.h'

I'm trying to install Scrapy, but got this error during installing: build\temp.win-amd64-2.7\Release_openssl.c(429) : fatal error C1083: Cannot open include file: 'openssl/opensslv.h': No such file or directory I've checked that the file…
kiral
  • 211
  • 1
  • 2
  • 3