Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

Designed with simplicity in mind
Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
Portable, open-source, 100% Python
Written in python and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.

Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using condaconda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.

Online resources:

Official site
Official docs
Git Repository
FAQ (see also Recent tab of scrapy tag)
Tutorial for beginners
Curated Scrapy links (libraries, related projects, etc)

17743 questions

votes

1 answer

python connect signal not being called

I have below file and code import logging from scrapy import signals from scrapy.exceptions import NotConfigured logger = logging.getLogger(__name__) class SpiderOpenCloseLogging: def __init__(self, item_count): self.item_count =…

python scrapy scrapy-signal

asked Jul 04 '22 at 10:59

Md Parvez Alam

votes

2 answers

Why can't I get cookie value in Playwright?

Firstly, sry for my poor English I want to use playwright to get the cookie, but I can't. I tried 3 ways I've found, and got nothing. Using page.on page.on('request',get_cookie) page.on('response',get_cookie) def get_cookie(request): …

python cookies scrapy playwright

asked Jun 30 '22 at 04:19

SSjoewvv

votes

3 answers

How to check if text is Japanese Hiragana in Python?

I'm making a web crawler using python scrapy to collect text from websites. I only want to collect Japanese Hiragana text. Is there a solution to detect Japanese Hiragana text?

python-3.x web scrapy web-crawler

asked Apr 26 '22 at 14:48

Shojib Hasan

votes

2 answers

Scrapy only scraping and crawling HTML and TXT

For learning purposes, I've been trying to recursively crawl and scrape all URLs on https://triniate.com/images/, but it seems that Scrapy only wants to crawl and scrape TXT, HTML, and PHP URLs. Here is my spider code from scrapy.spiders import…

python web-scraping scrapy

asked Apr 22 '22 at 03:34

HotPizza HotPizza

votes

1 answer

Scrapy spider closing after first request to start_urls

I am running my spider in the same structure as my other ones, but for this specific website and this specific spider, it closes after the very first request to starting url. What could possibly be the problem? Terminal Output: ... 2022-04-03…

python web-scraping scrapy web-crawler

asked Apr 03 '22 at 14:56

dovexz12323

votes

1 answer

Add Scrapy data to csv without the header row

We have a local website that tracks the number of people using a certain license. I have create a scraper with that should run every hour. The only issue I have it's creating data that looks like…

python csv web-scraping scrapy export-to-csv

asked Mar 22 '22 at 00:55

jack

votes

4 answers

twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

I am having this error when I run a crawl process multiples times. I am using scrapy 2.6 This is my code: from scrapy.crawler import CrawlerProcess from football.spiders.laliga import LaligaSpider from scrapy.utils.project import…

python scrapy

asked Mar 20 '22 at 17:20

Leandro Hernández Mira

votes

2 answers

Deploy Scrapy Project with Streamlit

I have a scrapy spider that scrapes products information from amazon based on the product link. I want to deploy this project with streamlit and take the product link as web input, and product information as output data on the web. I don't know alot…

python web-scraping scrapy streamlit

asked Feb 07 '22 at 16:56

AbhayParashar31

votes

1 answer

Problem logging into Facebook with Scrapy

(I have asked this question on the Scrapy google-group without luck.) I am trying to log into Facebook using Scrapy. I tried the following in the interactive shell: I set the headers and created a request as follows: header_vals={'Accept-Language':…

python screen-scraping scrapy

asked Aug 17 '11 at 13:03

Cygorger

votes

0 answers

How do we configure webshare proxy with api key in scrapy and also make use of scrapy-proxy-pool?

I have webshare proxy API; and would like to use it in a scrapy script. Which all configuration changes I will need to make in my script files as well enable it to make use of scrapy-proxy-pool also.

python web-scraping scrapy

asked Dec 21 '21 at 16:49

Vijay Kurhade

votes

1 answer

Yielding values from consecutive parallel parse functions via meta in Scrapy

In my scrapy code I'm trying to yield the following figures from parliament's website where all the members of parliament (MPs) are listed. Opening the links for each MP, I'm making parallel requests to get the figures I'm trying to count. I'm…

python scrapy yield meta

asked Dec 17 '21 at 21:46

avakado0

votes

3 answers

scrapy-playwright:- Downloader/handlers: scrapy.exceptions.NotSupported: AsyncioSelectorReactor

I tried to extract some data from dynamically loaded javascript website using scrapy-playwright but I stuck at the very beginning. From where I'm facing trubles in settings.py file is as follows: #playwright DOWNLOAD_HANDLERS = { "http":…

python scrapy playwright playwright-python

asked Dec 08 '21 at 12:50

user17063618

votes

2 answers

How to scrape site protected by cloudfare

So I'm trying to scrape https://craft.co/tesla When I visit from the browser, it opens correctly. However, when I use scrapy, it fetches the site but when I view the response, view(response) It shows the cloudfare site instead of the actual…

python web-scraping scrapy web-crawler

asked Nov 30 '21 at 14:39

nfon jeannoel

votes

2 answers

Scrapy parse function not called

I have this simply code: import scrapy import re import json # from scrapy.http import FormRequest from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class SpiderRecipe(CrawlSpider): name = "recipe" …

python scrapy web-crawler

asked Nov 23 '21 at 11:31

Ele975

votes

2 answers

Scrapy crawls duplicate data

unfortunately I currently have a problem with Scrapy. I am still new to Scrapy and would like to scrap information on Rolex watches. I started with the site Watch.de, where I first go through the Rolex site and want to open the individual watches to…

python scrapy web-crawler

asked Oct 26 '21 at 11:42

SyrixGG

Prev 1 2 3

…

99 100 Next