Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen and web framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

  • Designed with simplicity in mind
  • Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
  • Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
  • Portable, open-source, 100% Python
  • Written in and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.


Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using conda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

enter image description here


Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.


Online resources:

17743 questions
29
votes
2 answers

How to force scrapy to crawl duplicate url?

I am learning Scrapy a web crawling framework. by default it does not crawl duplicate urls or urls which scrapy have already crawled. How to make Scrapy to crawl duplicate urls or urls which have already crawled? I tried to find out on internet…
Alok
  • 7,734
  • 8
  • 55
  • 100
29
votes
7 answers

Missing scheme in request URL

I've been stuck on this bug for a while, the following error message is as follows: File "C:\Python27\lib\site-packages\scrapy-0.20.2-py2.7.egg\scrapy\http\request\__init__.py", line 61, in _set_url raise ValueError('Missing scheme in…
Toby
  • 350
  • 1
  • 4
  • 10
28
votes
6 answers

pyconfig.h missing during "pip install cryptography"

I wanna set up scrapy cluster follow this link scrapy-cluster,Everything is ok before I run this command: pip install -r requirements.txt The requirements.txt looks…
FancyXun
  • 1,218
  • 1
  • 9
  • 17
28
votes
6 answers

Scrapy - Silently drop an item

I am using Scrapy to crawl several websites, which may share redundant information. For each page I scrape, I store the url of the page, its title and its html code, into mongoDB. I want to avoid duplication in database, thus, I implement a pipeline…
Balthazar Rouberol
  • 6,822
  • 2
  • 35
  • 41
27
votes
2 answers

How to generate the start_urls dynamically in crawling?

I am crawling a site which may contain a lot of start_urls, like: http://www.a.com/list_1_2_3.htm I want to populate start_urls like [list_\d+_\d+_\d+\.htm], and extract items from URLs like [node_\d+\.htm] during crawling. Can I use CrawlSpider…
user1215269
  • 271
  • 1
  • 3
  • 3
27
votes
13 answers

Scrapy Crawl URLs in Order

So, my problem is relatively simple. I have one spider crawling multiple sites, and I need it to return the data in the order I write it in my code. It's posted below. from scrapy.spider import BaseSpider from scrapy.selector import…
Jeff
  • 295
  • 1
  • 3
  • 5
27
votes
3 answers

Scrapy Shell and Scrapy Splash

We've been using scrapy-splash middleware to pass the scraped HTML source through the Splash javascript engine running inside a docker container. If we want to use Splash in the spider, we configure several required project settings and yield a…
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
27
votes
3 answers

For scrapy/selenium is there a way to go back to a previous page?

I essentially have a start_url that has my javascript search form and button, hence the need of selenium. I use selenium to select the appropriate items in my select box objects, and click the search button. The following page, I do some scrapy…
petermaxstack
  • 325
  • 1
  • 3
  • 10
27
votes
2 answers

How to disable or change the path of ghostdriver.log?

Question is straightfoward, but some context may help. I'm trying to deploy scrapy while using selenium and phantomjs as downloader. But the problem is that it keeps on saying permission denied when trying to deploy. So I want to change the path of…
Sam Stoelinga
  • 4,881
  • 7
  • 39
  • 54
26
votes
1 answer

How to drop a collection with pymongo?

I use scarpy to crawl data and save it to cloud hosting mLab successfully with MongoDB. My collection name is recently and data's count is 5. I want to crawl data again and update my collection recently, so i try to drop the collection and then…
Morton
  • 5,380
  • 18
  • 63
  • 118
26
votes
1 answer

ScrapyRT vs Scrapyd

We've been using Scrapyd service for a while up until now. It provides a nice wrapper around a scrapy project and its spiders letting to control the spiders via an HTTP API: Scrapyd is a service for running Scrapy spiders. It allows you to deploy…
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
26
votes
4 answers

Geopy: catch timeout error

I am using geopy to geocode some addresses and I want to catch the timeout errors and print them out so I can do some quality control on the input. I am putting the geocode request in a try/catch but it's not working. Any ideas on what I need to do?…
MoreScratch
  • 2,933
  • 6
  • 34
  • 65
26
votes
2 answers

Can Scrapy be replaced by pyspider?

I've been using Scrapy web-scraping framework pretty extensively, but, recently I've discovered that there is another framework/system called pyspider, which, according to it's github page, is fresh, actively developed and popular. pyspider's home…
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
26
votes
4 answers

scrapy from script output in json

I am running scrapy in a python script def setup_crawler(domain): dispatcher.connect(stop_reactor, signal=signals.spider_closed) spider = ArgosSpider(domain=domain) settings = get_project_settings() crawler = Crawler(settings) …
Wasif Khalil
  • 2,217
  • 9
  • 33
  • 58
26
votes
2 answers

Scrapy Very Basic Example

Hi I have Python Scrapy installed on my mac and I was trying to follow the very first example on their web. They were trying to run the command: scrapy crawl mininova.org -o scraped_data.json -t json I don't quite understand what does this mean?…
B.Mr.W.
  • 18,910
  • 35
  • 114
  • 178