Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen and web framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

  • Designed with simplicity in mind
  • Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
  • Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
  • Portable, open-source, 100% Python
  • Written in and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.


Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using conda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

enter image description here


Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.


Online resources:

17743 questions
3
votes
2 answers

Scrapy spider shows errors of another unrelated spider in the same project

Im trying to create a new spider by running scrapy genspider -t crawl newspider "example.com". This is run in my recently created spider project directory C:\Users\donik\bo_gui\gui_project. As a result I get an error message: File…
d1spstack
  • 930
  • 1
  • 5
  • 16
3
votes
0 answers

Scrapy not running in docker

I am trying to run my scrapy script main.py in a docker container. The script runs 3 spiders sequentially and writes their scraped items onto a local DB. Here is the source code of main.py: from twisted.internet import reactor, defer from…
giulio di zio
  • 171
  • 1
  • 11
3
votes
1 answer

Search for specific text in XML tree and extract text in next node

Trying to scrape the weight of smartwatches from www.currys.co.uk. The website does not follow the same structure for all products so to get the weight of each product I am trying to use a keyword search using…
sophocles
  • 13,593
  • 3
  • 14
  • 33
3
votes
1 answer

With Scrapy, How do I check links on a single page is allowed from robots.txt file?

With Scrapy, I will scrape a single page (via script and not from console) to check all the links on this page if they are allowed by the robots.txt file. In the scrapy.robotstxt.RobotParser abstract base class, I found the method allowed(url,…
LeMoussel
  • 5,290
  • 12
  • 69
  • 122
3
votes
0 answers

Does the "value" property of twisted.python.failure.Failure has a traceback? If no, how to build the traceback?

I have a project that depends on Scrapy 2.3.0 which uses Twisted 20.3.0 as its network engine. I am trying to convert the callback based approach used by Scrapy to coroutines and run it with Python's asyncio. To make a HTTP request, one needs to…
hldev
  • 914
  • 8
  • 18
3
votes
3 answers

How to scrape the same url in loop with Scrapy

Needed content is located on the same page with a static URL. I created a spider that scrapes this page and stores the items in CSV. But it does so only once and then finish the crawling process. But I need repeat the operation continuously. How can…
3
votes
0 answers

Scrapy multiple pages in same structure

I have the following code import scrapy import re class NamePriceSpider(scrapy.Spider): name = 'namePrice' start_urls = [ 'https://www.cotodigital3.com.ar/sitios/cdigi/browse/' ] def parse(self, response): …
3
votes
1 answer

How to insert multiple items into database when using Scrapy?

Nowadays most databases support inserting multiple records into database in one run. That is much faster than inserting records one by one, because only one transaction is need. The SQL syntax is similar to this: INSERT INTO tbl_name…
Just a learner
  • 26,690
  • 50
  • 155
  • 234
3
votes
1 answer

Extract URL where text matches a regex - with XPath 1.0

I would like to extract the URL of this type (link text is a number with any number of digits and href is a random text) using an XPath in Scrapy.
user
  • 17,781
  • 20
  • 98
  • 124
3
votes
0 answers

Call to deprecated function retry_on_eintr. retry_on_eintr(check_call, [sys.executable, 'setup.py', 'clean', '-a', 'bdist_egg', '-d', d]

I have to deploy my scrapy project on scrapyd on windows server 2016. I am using the below command to deploy my project scrapyd -deploy local but it generates the following error Call to deprecated function retry_on_eintr. …
3
votes
3 answers

Unable to force a script to retry for five times unless a 200 status in between

I've created a script using scrapy which is capable of retrying some links from a list recursively even when those links are invalid and get 404 response. I used dont_filter=True and 'handle_httpstatus_list': [404] within meta to achieve the current…
MITHU
  • 113
  • 3
  • 12
  • 41
3
votes
1 answer

How to trigger a JS ASP.Net next page event using scrapy?

I'm scraping content off this website I start by sending a FormRequest that yields the search result based on Wim Herman's answer on my other question here I scrape what is needed and want to move to the next page which does not consist of a url,…
user12690225
3
votes
1 answer

How to resolve Splash 405 https://www.controller.com/listings/aircraft/for-sale/list>: HTTP status code is not handled or not allowed

i am trying to access a website Using Scrapy-Splash but i get error 405 Ignoring response <405 https://www.controller.com/>: HTTP status code is not handled or not allowed The Code i use import scrapy from scrapy_splash import SplashRequest class…
Muhammad Zeeshan
  • 4,591
  • 3
  • 11
  • 20
3
votes
0 answers

Generate exe from Scrapy project

I'm trying to use PyInstaller (more specifically, with auto-py-to-exe GUI) to generate a exe file from a project that uses Scrapy. The main file executes sequentially the two spiders: from scrapy.crawler import CrawlerRunner from twisted.internet…
Sinayra
  • 175
  • 1
  • 13
3
votes
1 answer

builtins.ModuleNotFoundError: No module named 'itemadapter'

I am trying to run to spider on scrapyhub and getting this error. But this spider working well on local machine. I already change the name of spider project and spider module. File "/usr/local/lib/python3.8/site-packages/scrapy/crawler.py",…
Tarun
  • 31
  • 2