Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

Designed with simplicity in mind
Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
Portable, open-source, 100% Python
Written in python and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.

Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using condaconda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.

Online resources:

Official site
Official docs
Git Repository
FAQ (see also Recent tab of scrapy tag)
Tutorial for beginners
Curated Scrapy links (libraries, related projects, etc)

17743 questions

votes

7 answers

Scrapy, Python: Multiple Item Classes in one pipeline?

I have a Spider that scrapes data which cannot be saved in one item class. For illustration, I have one Profile Item, and each Profile Item might have an unknown number of Comments. That is why I want to implement Profile Item and Comment Item. I…

python scrapy pipeline

asked Sep 23 '15 at 15:19

Nina

votes

2 answers

How do I catch errors with scrapy so I can do something when I get User Timeout error?

ERROR: Error downloading : User timeout caused connection failure. I get this issue every now and then when using my scraper. Is there a way I can catch this issue and run a function when it happens? I can't find out how to do it…

python scrapy twisted

asked Jun 30 '15 at 18:44

Ryan Weinstein

7,015
4
17
23

votes

2 answers

Scrapy: Extract links and text

I am new to scrapy and I am trying to scrape the Ikea website webpage. The basic page with the list of locations as given here. My items.py file is given below: import scrapy class IkeaItem(scrapy.Item): name = scrapy.Field() link =…

python web-scraping scrapy

asked Jan 03 '15 at 09:00

praxmon

5,009
22
74
121

votes

4 answers

Export csv file from scrapy (not via command line)

I successfully tried to export my items into a csv file from the command line like: scrapy crawl spiderName -o filename.csv My question is: What is the easiest solution to do the same in the code? I need this as i extract the filename from…

python csv export-to-csv scrapy

asked Aug 06 '14 at 14:28

Chris

1,092
2
19
39

votes

2 answers

TypeError: '_sre.SRE_Match' object has no attribute 'getitem'

I'm currently getting this error and don't know what is means. Its a scrapy python project, this is the error I'm seeing: File "/bp_scraper/bp_scraper/httpmiddleware.py", line 22, in from_crawler return cls(crawler.settings) File…

python scrapy

asked Mar 10 '14 at 23:46

user3403945

votes

3 answers

How can I use the fields_to_export attribute in BaseItemExporter to order my Scrapy CSV data?

I have made a simple Scrapy spider that I use from the command line to export my data into the CSV format, but the order of the data seem random. How can I order the CSV fields in my output? I use the following command line to get CSV data: scrapy…

python csv scrapy

asked Dec 24 '13 at 00:47

not2qubit

14,531
8
95
135

votes

3 answers

Scrapy: Passing item between methods

Suppose I have a Bookitem, I need to add information to it in both the parse phase and detail phase def parse(self, response) data = json.loads(response) for book in data['result']: item = BookItem(); item['id'] = book['id'] …

python scrapy

asked Dec 18 '13 at 16:18

Dionysian

1,195
2
13
24

votes

5 answers

How can scrapy export items to separate csv files per item

I am scraping a soccer site and the spider (a single spider) gets several kinds of items from the site's pages: Team, Match, Club etc. I am trying to use the CSVItemExporter to store these items in separate csv files, teams.csv, matches.csv,…

csv scrapy exporter

asked Sep 01 '12 at 18:34

Diomedes

votes

3 answers

Writing items to a MySQL database in Scrapy

I am new to Scrapy, I had the spider code class Example_spider(BaseSpider): name = "example" allowed_domains = ["www.example.com"] def start_requests(self): yield self.make_requests_from_url("http://www.example.com/bookstore/new") …

mysql scrapy pipeline web-crawler

asked Jun 01 '12 at 07:03

Shiva Krishna Bavandla

25,548
75
193
313

votes

3 answers

how to overwrite / use cookies in scrapy

I want to scrap http://www.3andena.com/, this web site starts first in Arabic, and it stores the language settings in cookies. If you tried to access the language version directly through URL (http://www.3andena.com/home.php?sl=en), it makes a…

python scrapy

asked May 19 '12 at 17:01

Mahmoud M. Abdel-Fattah

1,479
2
16
34

votes

6 answers

Scrapy: ImportError: No module named items

When I try to run scrapy I get this error ImportError: No module named items I just added in items.py the list of things I want to scrape and in the spider.py I have imported the class with from spider.items import SpiderItem Dont know why its not…

python scrapy

asked May 13 '12 at 09:22

jsjc

1,003
2
12
24

votes

8 answers

Scrapy - logging to file and stdout simultaneously, with spider names

I've decided to use the Python logging module because the messages generated by Twisted on std error is too long, and I want to INFO level meaningful messages such as those generated by the StatsCollector to be written on a separate log file while…

python web-crawler scrapy

asked Dec 16 '11 at 09:37

goh

27,631
28
89
151

votes

5 answers

Python Scrapy: Convert relative paths to absolute paths

I have amended the code based on solutions offered below by the great folks here; I get the error shown below the code here. from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.utils.response import…

python scrapy imagesource

asked Jun 27 '11 at 22:19

user818190

votes

2 answers

How to bypass Incapsula with Python

I use Scrapy and I try to scrape this site that uses Incapsula I had already asked a Question about this issue …

python scrapy recaptcha incapsula

asked Apr 10 '18 at 10:20

parik

2,313
12
39
67

votes

1 answer

Scrapy .css select element with a specific attribute name and value

How can Scrapy be used to select the text of an element that has a particular attribute name and value? For example, Montreal I tried the following but received a…

python python-2.7 scrapy

asked Oct 02 '16 at 04:23

Nyxynyx

61,411
155
482
830

Prev 1 2 3

…

99 100 Next