Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen and web framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

  • Designed with simplicity in mind
  • Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
  • Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
  • Portable, open-source, 100% Python
  • Written in and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.


Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using conda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

enter image description here


Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.


Online resources:

17743 questions
3
votes
1 answer

python connect signal not being called

I have below file and code import logging from scrapy import signals from scrapy.exceptions import NotConfigured logger = logging.getLogger(__name__) class SpiderOpenCloseLogging: def __init__(self, item_count): self.item_count =…
3
votes
2 answers

Why can't I get cookie value in Playwright?

Firstly, sry for my poor English I want to use playwright to get the cookie, but I can't. I tried 3 ways I've found, and got nothing. Using page.on page.on('request',get_cookie) page.on('response',get_cookie) def get_cookie(request): …
SSjoewvv
  • 33
  • 1
  • 4
3
votes
3 answers

How to check if text is Japanese Hiragana in Python?

I'm making a web crawler using python scrapy to collect text from websites. I only want to collect Japanese Hiragana text. Is there a solution to detect Japanese Hiragana text?
Shojib Hasan
  • 176
  • 1
  • 7
3
votes
2 answers

Scrapy only scraping and crawling HTML and TXT

For learning purposes, I've been trying to recursively crawl and scrape all URLs on https://triniate.com/images/, but it seems that Scrapy only wants to crawl and scrape TXT, HTML, and PHP URLs. Here is my spider code from scrapy.spiders import…
3
votes
1 answer

Scrapy spider closing after first request to start_urls

I am running my spider in the same structure as my other ones, but for this specific website and this specific spider, it closes after the very first request to starting url. What could possibly be the problem? Terminal Output: ... 2022-04-03…
dovexz12323
  • 197
  • 5
3
votes
1 answer

Add Scrapy data to csv without the header row

We have a local website that tracks the number of people using a certain license. I have create a scraper with that should run every hour. The only issue I have it's creating data that looks like…
jack
  • 31
  • 3
3
votes
4 answers

twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

I am having this error when I run a crawl process multiples times. I am using scrapy 2.6 This is my code: from scrapy.crawler import CrawlerProcess from football.spiders.laliga import LaligaSpider from scrapy.utils.project import…
3
votes
2 answers

Deploy Scrapy Project with Streamlit

I have a scrapy spider that scrapes products information from amazon based on the product link. I want to deploy this project with streamlit and take the product link as web input, and product information as output data on the web. I don't know alot…
3
votes
1 answer

Problem logging into Facebook with Scrapy

(I have asked this question on the Scrapy google-group without luck.) I am trying to log into Facebook using Scrapy. I tried the following in the interactive shell: I set the headers and created a request as follows: header_vals={'Accept-Language':…
Cygorger
  • 772
  • 7
  • 15
3
votes
0 answers

How do we configure webshare proxy with api key in scrapy and also make use of scrapy-proxy-pool?

I have webshare proxy API; and would like to use it in a scrapy script. Which all configuration changes I will need to make in my script files as well enable it to make use of scrapy-proxy-pool also.
3
votes
1 answer

Yielding values from consecutive parallel parse functions via meta in Scrapy

In my scrapy code I'm trying to yield the following figures from parliament's website where all the members of parliament (MPs) are listed. Opening the links for each MP, I'm making parallel requests to get the figures I'm trying to count. I'm…
avakado0
  • 101
  • 1
  • 9
3
votes
3 answers

scrapy-playwright:- Downloader/handlers: scrapy.exceptions.NotSupported: AsyncioSelectorReactor

I tried to extract some data from dynamically loaded javascript website using scrapy-playwright but I stuck at the very beginning. From where I'm facing trubles in settings.py file is as follows: #playwright DOWNLOAD_HANDLERS = { "http":…
user17063618
3
votes
2 answers

How to scrape site protected by cloudfare

So I'm trying to scrape https://craft.co/tesla When I visit from the browser, it opens correctly. However, when I use scrapy, it fetches the site but when I view the response, view(response) It shows the cloudfare site instead of the actual…
nfon jeannoel
  • 31
  • 1
  • 2
3
votes
2 answers

Scrapy parse function not called

I have this simply code: import scrapy import re import json # from scrapy.http import FormRequest from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class SpiderRecipe(CrawlSpider): name = "recipe" …
Ele975
  • 352
  • 3
  • 13
3
votes
2 answers

Scrapy crawls duplicate data

unfortunately I currently have a problem with Scrapy. I am still new to Scrapy and would like to scrap information on Rolex watches. I started with the site Watch.de, where I first go through the Rolex site and want to open the individual watches to…
SyrixGG
  • 57
  • 1
  • 7