Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen and web framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

  • Designed with simplicity in mind
  • Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
  • Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
  • Portable, open-source, 100% Python
  • Written in and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.


Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using conda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

enter image description here


Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.


Online resources:

17743 questions
26
votes
2 answers

how to implement nested item in scrapy?

I am scraping some data with complex hierarchical info and need to export the result to json. I defined the items as class FamilyItem(): name = Field() sons = Field() class SonsItem(): name = Field() grandsons = Field() class…
Shadow Lau
  • 451
  • 1
  • 7
  • 9
25
votes
4 answers

How to integrate Flask & Scrapy?

I'm using scrapy to get data and I want to use flask web framework to show the results in webpage. But I don't know how to call the spiders in the flask app. I've tried to use CrawlerProcess to call my spiders,but I got the error like this…
Coding_Rabbit
  • 1,287
  • 3
  • 22
  • 44
25
votes
1 answer

Using Scrapy to to find and download pdf files from a website

I've been tasked with pulling pdf files from websites using Scrapy. I'm not new to Python, but Scrapy is a very new to me. I've been experimenting with the console and a few rudimentary spiders. I've found and modified this code: import…
Murface
  • 363
  • 1
  • 3
  • 8
25
votes
2 answers

signal only works in main thread

i am new to django. I am trying to run my scrapy spider through django view. My scrapy code works perfectly when i run through command prompt. but when I try to run it on django it fails. The error message: signal only works in main thread. my code…
Jijo
  • 259
  • 1
  • 3
  • 6
25
votes
2 answers

How to use CrawlSpider from scrapy to click a link with javascript onclick?

I want scrapy to crawl pages where going on to the next link looks like this: Next Will scrapy be able to interpret javascript code of that? With livehttpheaders extension I found out that clicking…
ria
  • 7,198
  • 6
  • 29
  • 35
25
votes
3 answers

How can I get all the plain text from a website with Scrapy?

I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for…
tomasyany
  • 1,132
  • 3
  • 15
  • 32
25
votes
8 answers

Scrapy crawler in Cron job

I want to execute my scrapy crawler from cron job . i create bash file getdata.sh where scrapy project is located with it's spiders #!/bin/bash cd /myfolder/crawlers/ scrapy crawl my_spider_name My crontab looks like this , I want to execute it in…
Beka Tomashvili
  • 2,171
  • 5
  • 21
  • 27
25
votes
4 answers

scrapy- how to stop Redirect (302)

I'm trying to crawl a url using Scrapy. But it redirects me to page that doesn't exist. Redirecting (302) to
user_2000
  • 1,103
  • 3
  • 14
  • 26
25
votes
4 answers

How to setup and launch a Scrapy spider programmatically (urls and settings)

I've written a working crawler using scrapy, now I want to control it through a Django webapp, that is to say: Set 1 or several start_urls Set 1 or several allowed_domains Set settings values Start the spider Stop / pause / resume a…
arno
  • 507
  • 1
  • 5
  • 15
24
votes
2 answers

How do I use the Python Scrapy module to list all the URLs from my website?

I want to use the Python Scrapy module to scrape all the URLs from my website and write the list to a file. I looked in the examples but didn't see any simple example to do this.
Adam F
  • 1,151
  • 1
  • 11
  • 16
24
votes
2 answers

Scrapy - parse a page to extract items - then follow and store item url contents

I have a question on how to do this thing in scrapy. I have a spider that crawls for listing pages of items. Every time a listing page is found, with items, there's the parse_item() callback that is called for extracting items data, and yielding…
StefanH
  • 799
  • 1
  • 6
  • 22
24
votes
3 answers

ImportError: No module named win32api while using Scrapy

I am a new learner of Scrapy. I installed python 2.7 and all other engines needed. Then I tried to build a Scrapy project following the tutorial http://doc.scrapy.org/en/latest/intro/tutorial.html. In the crawling step, after I typed scrapy crawl…
李皓伟
  • 279
  • 1
  • 2
  • 3
24
votes
4 answers

scrapy run spider from script

I want to run my spider from a script rather than a scrap crawl I found this page http://doc.scrapy.org/en/latest/topics/practices.html but actually it doesn't say where to put that script. any help please?
Marco Dinatsoli
  • 10,322
  • 37
  • 139
  • 253
24
votes
1 answer

What is the difference between Scrapy's spider middleware and downloader middleware?

Both middleware can process Request and Response. But what is the difference?
Zhang Jiuzhou
  • 759
  • 8
  • 22
23
votes
7 answers

Debugging Scrapy Project in Visual Studio Code

I have Visual Studio Code on a Windows Machine, on which I am making a new Scrapy Crawler. The crawler is working fine but I want to debug the code, for which I am adding this in my launch.json file: { "name": "Scrapy with Integrated…
naqushab
  • 754
  • 1
  • 8
  • 24