Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

Designed with simplicity in mind
Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
Portable, open-source, 100% Python
Written in python and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.

Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using condaconda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.

Online resources:

Official site
Official docs
Git Repository
FAQ (see also Recent tab of scrapy tag)
Tutorial for beginners
Curated Scrapy links (libraries, related projects, etc)

17743 questions

votes

7 answers

Split hyphen separated words with spaces in between | Python

I want to split either comma, semicolon or hyphen (with preceding space) separated words. The reason for this is the inconsistent structure of a website I am scraping with Scrapy. So far, I am able to split either comma or semicolon separated words…

python scrapy

asked Nov 23 '19 at 05:10

Dan

votes

1 answer

Parse callback is not defined - Simple Webscraper (Scrapy) still not running

i googled half a day and still can't get it going. Maybe you got some insights? I tryed to start my scraper not from a terminal, but from a script. This works well without rules, just with yielding the normal parse function. As soon as I use Rules…

python scrapy web-crawler

asked Nov 18 '19 at 17:27

Mike89

votes

0 answers

Using Scrapy to scrape ASP.NET pages using VIEWSTATE

I followed this post SCRAPING WEBSITES BASED ON VIEWSTATES WITH SCRAPY to scrape a site that is almost identical. It works well but the problem is that my site has many items and thus has a lot of pagination. I am able to go to the next pages but…

python asp.net scrapy viewstate

asked Nov 17 '19 at 13:56

Phillis Peters

2,232
3
19
40

votes

4 answers

Scrapy select HTML elements that have specific attribute name

There is this HTML:

...

I need to select the inner div that have the attribute data-id (regardless of values) only. How do I…

python python-3.x scrapy

asked Nov 04 '19 at 17:11

hydradon

1,316
1
21
52

votes

1 answer

Scrapy and Incapsula

I'm trying to use Scrapy with Splash to retrieve data from the website "whoscored.com". Here is my settings: BOT_NAME = 'scrapy_matchs' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'scrapy_matchs…

python web-scraping scrapy splash-js-render

asked Oct 30 '19 at 15:06

Jérémy Octeau

votes

0 answers

How to enable javascript in Splash

I have been recently introduced to Splash. I'm currently trying to render the webpage of the company that I work at (I prefer not to name the company) in the splash API. When I try to render the page in the Splash API, the html contains a message…

javascript scrapy scrapy-splash splash-js-render

asked Oct 25 '19 at 10:27

titusAdam

votes

3 answers

Scrapy does not have command 'crawl'

I started to learn Scrapy but right away I get an error Unknown command: crawl. I do not know why im getting this, but in py Scrapy commands I do not have that command. Im using python 3.6 and pycharm as editor. (venv)…

python scrapy

asked Sep 25 '19 at 11:15

taga

3,537
13
53
119

votes

0 answers

ValueError: not enough values to unpack (expected 2, got 1) Json dumps

I got error while using scrapy ValueError: not enough values to unpack (expected 2, got 1) while json.dumps(form_data) My code is below is below: form_data = {"directory_search_id":"12093", "elements":{ "0" : {"id":"38", …

python scrapy

asked Sep 17 '19 at 16:48

muhammed fairoos nm

votes

1 answer

Using Scrapy and Splash to Follow javascript pagination

I am using Scrapy and splash to extract the data. I am looking to find a way to follow pagination that was powered with javascript. The URL is not changing it is always the same no matter on what page you are.

python scrapy scrapy-splash

asked Aug 24 '19 at 21:40

m1k1

votes

2 answers

duplicate requests post to scrapy FormRequest

I am try to learn how scrapy FormRequest works on website,I have the following scrapy code: import scrapy import json from scrapy.utils.response import open_in_browser class Test(scrapy.Spider): name = 'go2' def start_requests(self): …

python scrapy python-requests

asked Aug 11 '19 at 18:09

hadesfv

votes

2 answers

AttributeError: 'str' object has no attribute 'xpath'

Using Python 3,Scrapy 1.7.3 to Following using following link Scrapy - Extract items from table but it is giving me error of AttributeError: 'str' object has no attribute 'xpath'

python xpath scrapy

asked Aug 08 '19 at 17:19

Red Baron

votes

1 answer

Why is scrapy with crawlera running so slow?

I am using scrapy 1.7.3 with crawlera (C100 plan from scrapinghub) and python 3.6. When running the spider with crawlera enabled I get about 20 - 40 items per minute. Without crawlera I get 750 - 1000 (but I get banned quickly of course). Have I…

python scrapy scrapinghub crawlera

asked Aug 03 '19 at 17:29

Wramana

votes

0 answers

scrapy-splash crawler starts fast but slows down (not throttled by website)

I have a single crawler written in scrapy using the splash browser via the scrapy-splash python package. I am using the aquarium python package to load balance the parallel scrapy requests to a splash docker cluster. The scraper uses a long list of…

scrapy scrapy-splash splash-js-render

asked Jul 31 '19 at 21:25

user1837332

votes

1 answer

deny certain links in scrapy linkextractor

with open('/home/timmy/myamazon/bannedasins.txt') as f: banned_asins = f.read().split('\n') class AmazonSpider(CrawlSpider): name = 'amazon' allowed_domains = ['amazon.com',] rules = ( …

python scrapy

asked Jul 21 '19 at 23:25

programmerwiz32

votes

2 answers

How to package Scrapy dependency to lambda?

I am writing a python application which dependents on Scrapy module. It works fine locally but failed when I run it from aws lambda test console. My python project has a requirements.txt file with below dependency: scrapy==1.6.0 I packaged all…

python aws-lambda scrapy

asked Jul 19 '19 at 00:46

Joey Yi Zhao

37,514
71
268
523

Prev 1 2 3

…

100