Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen and web framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

  • Designed with simplicity in mind
  • Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
  • Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
  • Portable, open-source, 100% Python
  • Written in and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.


Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using conda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

enter image description here


Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.


Online resources:

17743 questions
21
votes
7 answers

Scrapy, Python: Multiple Item Classes in one pipeline?

I have a Spider that scrapes data which cannot be saved in one item class. For illustration, I have one Profile Item, and each Profile Item might have an unknown number of Comments. That is why I want to implement Profile Item and Comment Item. I…
Nina
  • 211
  • 1
  • 2
  • 4
21
votes
2 answers

How do I catch errors with scrapy so I can do something when I get User Timeout error?

ERROR: Error downloading : User timeout caused connection failure. I get this issue every now and then when using my scraper. Is there a way I can catch this issue and run a function when it happens? I can't find out how to do it…
Ryan Weinstein
  • 7,015
  • 4
  • 17
  • 23
21
votes
2 answers

Scrapy: Extract links and text

I am new to scrapy and I am trying to scrape the Ikea website webpage. The basic page with the list of locations as given here. My items.py file is given below: import scrapy class IkeaItem(scrapy.Item): name = scrapy.Field() link =…
praxmon
  • 5,009
  • 22
  • 74
  • 121
21
votes
4 answers

Export csv file from scrapy (not via command line)

I successfully tried to export my items into a csv file from the command line like: scrapy crawl spiderName -o filename.csv My question is: What is the easiest solution to do the same in the code? I need this as i extract the filename from…
Chris
  • 1,092
  • 2
  • 19
  • 39
21
votes
2 answers

TypeError: '_sre.SRE_Match' object has no attribute '__getitem__'

I'm currently getting this error and don't know what is means. Its a scrapy python project, this is the error I'm seeing: File "/bp_scraper/bp_scraper/httpmiddleware.py", line 22, in from_crawler return cls(crawler.settings) File…
user3403945
  • 211
  • 1
  • 2
  • 3
21
votes
3 answers

How can I use the fields_to_export attribute in BaseItemExporter to order my Scrapy CSV data?

I have made a simple Scrapy spider that I use from the command line to export my data into the CSV format, but the order of the data seem random. How can I order the CSV fields in my output? I use the following command line to get CSV data: scrapy…
not2qubit
  • 14,531
  • 8
  • 95
  • 135
21
votes
3 answers

Scrapy: Passing item between methods

Suppose I have a Bookitem, I need to add information to it in both the parse phase and detail phase def parse(self, response) data = json.loads(response) for book in data['result']: item = BookItem(); item['id'] = book['id'] …
Dionysian
  • 1,195
  • 2
  • 13
  • 24
21
votes
5 answers

How can scrapy export items to separate csv files per item

I am scraping a soccer site and the spider (a single spider) gets several kinds of items from the site's pages: Team, Match, Club etc. I am trying to use the CSVItemExporter to store these items in separate csv files, teams.csv, matches.csv,…
Diomedes
  • 938
  • 1
  • 7
  • 15
21
votes
3 answers

Writing items to a MySQL database in Scrapy

I am new to Scrapy, I had the spider code class Example_spider(BaseSpider): name = "example" allowed_domains = ["www.example.com"] def start_requests(self): yield self.make_requests_from_url("http://www.example.com/bookstore/new") …
Shiva Krishna Bavandla
  • 25,548
  • 75
  • 193
  • 313
21
votes
3 answers

how to overwrite / use cookies in scrapy

I want to scrap http://www.3andena.com/, this web site starts first in Arabic, and it stores the language settings in cookies. If you tried to access the language version directly through URL (http://www.3andena.com/home.php?sl=en), it makes a…
Mahmoud M. Abdel-Fattah
  • 1,479
  • 2
  • 16
  • 34
21
votes
6 answers

Scrapy: ImportError: No module named items

When I try to run scrapy I get this error ImportError: No module named items I just added in items.py the list of things I want to scrape and in the spider.py I have imported the class with from spider.items import SpiderItem Dont know why its not…
jsjc
  • 1,003
  • 2
  • 12
  • 24
20
votes
8 answers

Scrapy - logging to file and stdout simultaneously, with spider names

I've decided to use the Python logging module because the messages generated by Twisted on std error is too long, and I want to INFO level meaningful messages such as those generated by the StatsCollector to be written on a separate log file while…
goh
  • 27,631
  • 28
  • 89
  • 151
20
votes
5 answers

Python Scrapy: Convert relative paths to absolute paths

I have amended the code based on solutions offered below by the great folks here; I get the error shown below the code here. from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.utils.response import…
user818190
  • 579
  • 2
  • 11
  • 21
20
votes
2 answers

How to bypass Incapsula with Python

I use Scrapy and I try to scrape this site that uses Incapsula I had already asked a Question about this issue …
parik
  • 2,313
  • 12
  • 39
  • 67
20
votes
1 answer

Scrapy .css select element with a specific attribute name and value

How can Scrapy be used to select the text of an element that has a particular attribute name and value? For example, Montreal I tried the following but received a…
Nyxynyx
  • 61,411
  • 155
  • 482
  • 830