Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

Designed with simplicity in mind
Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
Portable, open-source, 100% Python
Written in python and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.

Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using condaconda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.

Online resources:

Official site
Official docs
Git Repository
FAQ (see also Recent tab of scrapy tag)
Tutorial for beginners
Curated Scrapy links (libraries, related projects, etc)

17743 questions

votes

2 answers

Scrapy spider shows errors of another unrelated spider in the same project

Im trying to create a new spider by running scrapy genspider -t crawl newspider "example.com". This is run in my recently created spider project directory C:\Users\donik\bo_gui\gui_project. As a result I get an error message: File…

python scrapy

asked Nov 13 '20 at 12:53

d1spstack

votes

0 answers

Scrapy not running in docker

I am trying to run my scrapy script main.py in a docker container. The script runs 3 spiders sequentially and writes their scraped items onto a local DB. Here is the source code of main.py: from twisted.internet import reactor, defer from…

python docker scrapy

asked Nov 08 '20 at 15:05

giulio di zio

votes

1 answer

Search for specific text in XML tree and extract text in next node

Trying to scrape the weight of smartwatches from www.currys.co.uk. The website does not follow the same structure for all products so to get the weight of each product I am trying to use a keyword search using…

xml web-scraping xpath scrapy contains

asked Oct 30 '20 at 12:14

sophocles

13,593
3
14
33

votes

1 answer

With Scrapy, How do I check links on a single page is allowed from robots.txt file?

With Scrapy, I will scrape a single page (via script and not from console) to check all the links on this page if they are allowed by the robots.txt file. In the scrapy.robotstxt.RobotParser abstract base class, I found the method allowed(url,…

python scrapy

asked Oct 23 '20 at 07:14

LeMoussel

5,290
12
69
122

votes

0 answers

Does the "value" property of twisted.python.failure.Failure has a traceback? If no, how to build the traceback?

I have a project that depends on Scrapy 2.3.0 which uses Twisted 20.3.0 as its network engine. I am trying to convert the callback based approach used by Scrapy to coroutines and run it with Python's asyncio. To make a HTTP request, one needs to…

python scrapy python-asyncio twisted

asked Oct 22 '20 at 16:54

hldev

votes

3 answers

How to scrape the same url in loop with Scrapy

Needed content is located on the same page with a static URL. I created a spider that scrapes this page and stores the items in CSV. But it does so only once and then finish the crawling process. But I need repeat the operation continuously. How can…

python scrapy

asked Jun 22 '11 at 21:29

J. Random Geek

votes

0 answers

Scrapy multiple pages in same structure

I have the following code import scrapy import re class NamePriceSpider(scrapy.Spider): name = 'namePrice' start_urls = [ 'https://www.cotodigital3.com.ar/sitios/cdigi/browse/' ] def parse(self, response): …

python json web-scraping scrapy

asked Oct 11 '20 at 15:45

NICOLAS FRANCO RAMPOLDI

votes

1 answer

How to insert multiple items into database when using Scrapy?

Nowadays most databases support inserting multiple records into database in one run. That is much faster than inserting records one by one, because only one transaction is need. The SQL syntax is similar to this: INSERT INTO tbl_name…

python scrapy sql-insert bulkinsert

asked Oct 09 '20 at 18:17

Just a learner

26,690
50
155
234

votes

1 answer

Extract URL where text matches a regex - with XPath 1.0

I would like to extract the URL of this type (link text is a number with any number of digits and href is a random text) using an XPath in Scrapy.

python regex xpath lxml scrapy

asked Jun 19 '11 at 14:30

user

17,781
20
98
124

votes

0 answers

Call to deprecated function retry_on_eintr. retry_on_eintr(check_call, [sys.executable, 'setup.py', 'clean', '-a', 'bdist_egg', '-d', d]

I have to deploy my scrapy project on scrapyd on windows server 2016. I am using the below command to deploy my project scrapyd -deploy local but it generates the following error Call to deprecated function retry_on_eintr. …

python scrapy scrapyd scrapyd-deploy

asked Sep 22 '20 at 13:24

Bilawal Ali

votes

3 answers

Unable to force a script to retry for five times unless a 200 status in between

I've created a script using scrapy which is capable of retrying some links from a list recursively even when those links are invalid and get 404 response. I used dont_filter=True and 'handle_httpstatus_list': [404] within meta to achieve the current…

python python-3.x web-scraping scrapy

asked Sep 03 '20 at 10:34

MITHU

votes

1 answer

How to trigger a JS ASP.Net next page event using scrapy?

I'm scraping content off this website I start by sending a FormRequest that yields the search result based on Wim Herman's answer on my other question here I scrape what is needed and want to move to the next page which does not consist of a url,…

javascript asp.net scrapy dom-events

asked Aug 24 '20 at 15:06

user12690225

votes

1 answer

How to resolve Splash 405 https://www.controller.com/listings/aircraft/for-sale/list>: HTTP status code is not handled or not allowed

i am trying to access a website Using Scrapy-Splash but i get error 405 Ignoring response <405 https://www.controller.com/>: HTTP status code is not handled or not allowed The Code i use import scrapy from scrapy_splash import SplashRequest class…

python scrapy scrapy-splash

asked Aug 15 '20 at 00:40

Muhammad Zeeshan

4,591
3
11
20

votes

0 answers

Generate exe from Scrapy project

I'm trying to use PyInstaller (more specifically, with auto-py-to-exe GUI) to generate a exe file from a project that uses Scrapy. The main file executes sequentially the two spiders: from scrapy.crawler import CrawlerRunner from twisted.internet…

python scrapy web-crawler exe pyinstaller

asked Aug 13 '20 at 20:51

Sinayra

votes

1 answer

builtins.ModuleNotFoundError: No module named 'itemadapter'

I am trying to run to spider on scrapyhub and getting this error. But this spider working well on local machine. I already change the name of spider project and spider module. File "/usr/local/lib/python3.8/site-packages/scrapy/crawler.py",…

python scrapy

asked Aug 07 '20 at 10:10

Tarun

Prev 1 2 3

…

99 100 Next