Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen and web framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

  • Designed with simplicity in mind
  • Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
  • Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
  • Portable, open-source, 100% Python
  • Written in and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.


Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using conda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

enter image description here


Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.


Online resources:

17743 questions
3
votes
7 answers

Split hyphen separated words with spaces in between | Python

I want to split either comma, semicolon or hyphen (with preceding space) separated words. The reason for this is the inconsistent structure of a website I am scraping with Scrapy. So far, I am able to split either comma or semicolon separated words…
Dan
  • 257
  • 3
  • 12
3
votes
1 answer

Parse callback is not defined - Simple Webscraper (Scrapy) still not running

i googled half a day and still can't get it going. Maybe you got some insights? I tryed to start my scraper not from a terminal, but from a script. This works well without rules, just with yielding the normal parse function. As soon as I use Rules…
Mike89
  • 43
  • 3
3
votes
0 answers

Using Scrapy to scrape ASP.NET pages using VIEWSTATE

I followed this post SCRAPING WEBSITES BASED ON VIEWSTATES WITH SCRAPY to scrape a site that is almost identical. It works well but the problem is that my site has many items and thus has a lot of pagination. I am able to go to the next pages but…
Phillis Peters
  • 2,232
  • 3
  • 19
  • 40
3
votes
4 answers

Scrapy select HTML elements that have specific attribute name

There is this HTML:
...
I need to select the inner div that have the attribute data-id (regardless of values) only. How do I…
hydradon
  • 1,316
  • 1
  • 21
  • 52
3
votes
1 answer

Scrapy and Incapsula

I'm trying to use Scrapy with Splash to retrieve data from the website "whoscored.com". Here is my settings: BOT_NAME = 'scrapy_matchs' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'scrapy_matchs…
Jérémy Octeau
  • 689
  • 1
  • 10
  • 26
3
votes
0 answers

How to enable javascript in Splash

I have been recently introduced to Splash. I'm currently trying to render the webpage of the company that I work at (I prefer not to name the company) in the splash API. When I try to render the page in the Splash API, the html contains a message…
titusAdam
  • 779
  • 1
  • 16
  • 35
3
votes
3 answers

Scrapy does not have command 'crawl'

I started to learn Scrapy but right away I get an error Unknown command: crawl. I do not know why im getting this, but in py Scrapy commands I do not have that command. Im using python 3.6 and pycharm as editor. (venv)…
taga
  • 3,537
  • 13
  • 53
  • 119
3
votes
0 answers

ValueError: not enough values to unpack (expected 2, got 1) Json dumps

I got error while using scrapy ValueError: not enough values to unpack (expected 2, got 1) while json.dumps(form_data) My code is below is below: form_data = {"directory_search_id":"12093", "elements":{ "0" : {"id":"38", …
3
votes
1 answer

Using Scrapy and Splash to Follow javascript pagination

I am using Scrapy and splash to extract the data. I am looking to find a way to follow pagination that was powered with javascript. The URL is not changing it is always the same no matter on what page you are.
  • m1k1
    • 143
    • 1
    • 2
    • 13
  • 3
    votes
    2 answers

    duplicate requests post to scrapy FormRequest

    I am try to learn how scrapy FormRequest works on website,I have the following scrapy code: import scrapy import json from scrapy.utils.response import open_in_browser class Test(scrapy.Spider): name = 'go2' def start_requests(self): …
    hadesfv
    • 386
    • 4
    • 18
    3
    votes
    2 answers

    AttributeError: 'str' object has no attribute 'xpath'

    Using Python 3,Scrapy 1.7.3 to Following using following link Scrapy - Extract items from table but it is giving me error of AttributeError: 'str' object has no attribute 'xpath'
    Red Baron
    • 65
    • 1
    • 6
    3
    votes
    1 answer

    Why is scrapy with crawlera running so slow?

    I am using scrapy 1.7.3 with crawlera (C100 plan from scrapinghub) and python 3.6. When running the spider with crawlera enabled I get about 20 - 40 items per minute. Without crawlera I get 750 - 1000 (but I get banned quickly of course). Have I…
    Wramana
    • 183
    • 1
    • 4
    • 16
    3
    votes
    0 answers

    scrapy-splash crawler starts fast but slows down (not throttled by website)

    I have a single crawler written in scrapy using the splash browser via the scrapy-splash python package. I am using the aquarium python package to load balance the parallel scrapy requests to a splash docker cluster. The scraper uses a long list of…
    user1837332
    • 91
    • 1
    • 3
    3
    votes
    1 answer

    deny certain links in scrapy linkextractor

    with open('/home/timmy/myamazon/bannedasins.txt') as f: banned_asins = f.read().split('\n') class AmazonSpider(CrawlSpider): name = 'amazon' allowed_domains = ['amazon.com',] rules = ( …
    programmerwiz32
    • 529
    • 1
    • 5
    • 20
    3
    votes
    2 answers

    How to package Scrapy dependency to lambda?

    I am writing a python application which dependents on Scrapy module. It works fine locally but failed when I run it from aws lambda test console. My python project has a requirements.txt file with below dependency: scrapy==1.6.0 I packaged all…
    Joey Yi Zhao
    • 37,514
    • 71
    • 268
    • 523
    1 2 3
    99
    100