Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

Designed with simplicity in mind
Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
Portable, open-source, 100% Python
Written in python and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.

Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using condaconda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.

Online resources:

Official site
Official docs
Git Repository
FAQ (see also Recent tab of scrapy tag)
Tutorial for beginners
Curated Scrapy links (libraries, related projects, etc)

17743 questions

votes

2 answers

how to implement nested item in scrapy?

I am scraping some data with complex hierarchical info and need to export the result to json. I defined the items as class FamilyItem(): name = Field() sons = Field() class SonsItem(): name = Field() grandsons = Field() class…

python json scrapy

asked Jun 25 '12 at 06:46

Shadow Lau

votes

4 answers

How to integrate Flask & Scrapy?

I'm using scrapy to get data and I want to use flask web framework to show the results in webpage. But I don't know how to call the spiders in the flask app. I've tried to use CrawlerProcess to call my spiders,but I got the error like this…

python flask scrapy

asked Apr 03 '16 at 10:25

Coding_Rabbit

1,287
3
22
44

votes

1 answer

Using Scrapy to to find and download pdf files from a website

I've been tasked with pulling pdf files from websites using Scrapy. I'm not new to Python, but Scrapy is a very new to me. I've been experimenting with the console and a few rudimentary spiders. I've found and modified this code: import…

python scrapy

asked Mar 21 '16 at 15:55

Murface

votes

2 answers

signal only works in main thread

i am new to django. I am trying to run my scrapy spider through django view. My scrapy code works perfectly when i run through command prompt. but when I try to run it on django it fails. The error message: signal only works in main thread. my code…

python django scrapy

asked Mar 10 '16 at 10:43

Jijo

votes

2 answers

How to use CrawlSpider from scrapy to click a link with javascript onclick?

I want scrapy to crawl pages where going on to the next link looks like this: Next Will scrapy be able to interpret javascript code of that? With livehttpheaders extension I found out that clicking…

javascript python onclick scrapy web-scraping

asked Mar 16 '10 at 14:12

ria

7,198
6
29
35

votes

3 answers

How can I get all the plain text from a website with Scrapy?

I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for…

python html xpath web-scraping scrapy

asked Apr 18 '14 at 15:03

tomasyany

1,132
3
15
32

votes

8 answers

Scrapy crawler in Cron job

I want to execute my scrapy crawler from cron job . i create bash file getdata.sh where scrapy project is located with it's spiders #!/bin/bash cd /myfolder/crawlers/ scrapy crawl my_spider_name My crontab looks like this , I want to execute it in…

ubuntu scrapy cron cron-task

asked Jun 21 '13 at 12:18

Beka Tomashvili

2,171
5
21
27

votes

4 answers

scrapy- how to stop Redirect (302)

I'm trying to crawl a url using Scrapy. But it redirects me to page that doesn't exist. Redirecting (302) to …

web-scraping web-crawler scrapy

asked Mar 18 '13 at 12:13

user_2000

1,103
3
14
26

votes

4 answers

How to setup and launch a Scrapy spider programmatically (urls and settings)

I've written a working crawler using scrapy, now I want to control it through a Django webapp, that is to say: Set 1 or several start_urls Set 1 or several allowed_domains Set settings values Start the spider Stop / pause / resume a…

python scrapy scrapyd

asked Oct 21 '12 at 10:10

arno

votes

2 answers

How do I use the Python Scrapy module to list all the URLs from my website?

I want to use the Python Scrapy module to scrape all the URLs from my website and write the list to a file. I looked in the examples but didn't see any simple example to do this.

python web-crawler scrapy

asked Mar 05 '12 at 02:43

Adam F

1,151
1
11
16

votes

2 answers

Scrapy - parse a page to extract items - then follow and store item url contents

I have a question on how to do this thing in scrapy. I have a spider that crawls for listing pages of items. Every time a listing page is found, with items, there's the parse_item() callback that is called for extracting items data, and yielding…

python scrapy

asked Apr 28 '11 at 22:45

StefanH

votes

3 answers

ImportError: No module named win32api while using Scrapy

I am a new learner of Scrapy. I installed python 2.7 and all other engines needed. Then I tried to build a Scrapy project following the tutorial http://doc.scrapy.org/en/latest/intro/tutorial.html. In the crawling step, after I typed scrapy crawl…

python scrapy

asked Sep 15 '15 at 12:45

李皓伟

votes

4 answers

scrapy run spider from script

I want to run my spider from a script rather than a scrap crawl I found this page http://doc.scrapy.org/en/latest/topics/practices.html but actually it doesn't say where to put that script. any help please?

python python-2.7 scrapy

asked Feb 09 '14 at 17:53

Marco Dinatsoli

10,322
37
139
253

votes

1 answer

What is the difference between Scrapy's spider middleware and downloader middleware?

Both middleware can process Request and Response. But what is the difference?

python scrapy web-crawler

asked Jul 26 '13 at 04:10

Zhang Jiuzhou

votes

7 answers

Debugging Scrapy Project in Visual Studio Code

I have Visual Studio Code on a Windows Machine, on which I am making a new Scrapy Crawler. The crawler is working fine but I want to debug the code, for which I am adding this in my launch.json file: { "name": "Scrapy with Integrated…

python python-3.x visual-studio scrapy visual-studio-code

asked Mar 09 '18 at 20:47

naqushab

Prev 1 2 3

…

99 100 Next