Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

Designed with simplicity in mind
Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
Portable, open-source, 100% Python
Written in python and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.

Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using condaconda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.

Online resources:

Official site
Official docs
Git Repository
FAQ (see also Recent tab of scrapy tag)
Tutorial for beginners
Curated Scrapy links (libraries, related projects, etc)

17743 questions

votes

1 answer

Python Scrapy : allowed_domains adding new domains from database

I need to add more domains to allowed_domains , so I dnt get the " Filtered offsite request to ". My app gets urls to fetch from a database, so I cant add them manually. I tried to override the spider init like this def __init__(self): …

screen-scraping web-scraping scrapy

asked Jun 12 '11 at 04:38

llazzaro

3,970
4
33
47

votes

0 answers

XML - How to get link from element when there is no link on href

I have this Html above and I need to get the link to…

html xml scrapy scrapy-splash splash-js-render

asked Jul 21 '20 at 20:09

João Koritar

votes

1 answer

How to pause spider in Scrapy

I'm new in scrapy and I need to pause a spider after receiving a response error (like 407, 429). Also, I should do this without using time.sleep(), and use middlewares or extensions. Here is my middlewares: from scrapy import signals from pydispatch…

python web-scraping scrapy

asked Jul 15 '20 at 07:05

Daniil

votes

1 answer

Can't grab the title of different listings from a webpage using scrapy

I'm trying to parse the title of different listings from this webpage. The titles are not dynamic as they are available in page source. However, it is necessary to send cookies in the first place to grab the titles. I've tried with the following way…

python python-3.x web-scraping cookies scrapy

asked Jul 12 '20 at 07:56

SMTH

votes

3 answers

CSS selector or XPath that gets information between two i tags?

I'm trying to scrape price information, and the HTML of the website looks like this $ "999" .00 I want to get 999. (I don't want the dollar sign or the .00) I currently…

css xpath web-scraping scrapy web-crawler

asked Jul 11 '20 at 13:03
Tianhe Xie

261

1

10

3
votes

1 answer

OSX Using os.system to run Scrapy Script

This question is all over SO but I'm not entirely sure if this solution even works. I have a scrapy script that I would like to convert to an .app file. There's a program called Platypus[] that uses a GUI to convert python projects into an…

python macos scrapy operating-system system

asked Jul 07 '20 at 18:40
chrisHG

80

1

2

18

3
votes

2 answers

Get spider list from Scrapy in Django Project

I'm following this answer to get the spider list on my Scrapy Project inside Django, so this is what the structure looks like. my_app/ -- apps/ # django apps folder -- crawler/ -- __init__.py -- admin.py -- apps.py --…

python django scrapy

asked Jun 30 '20 at 10:48
Khrisna Gunanasurya

725

10

36

3
votes

1 answer

Scraping data over websockets

I am trying to get the daily price data from this specific webpage: https://www.londonstockexchange.com/stock/CS1/amundi/company-page Those data are represented in the chart. I run out of idea to try to reach those data. I assume that those data are…

python web-scraping encoding websocket scrapy

asked Jun 16 '20 at 13:12
AL Ko

31

3

3
votes

1 answer

Scrapy: Run spiders seqential with different settings for each spider

For quite a few days now I'm having trouble with Scrapy/twisted in my Main.py which is supposed to run different spiders and analyze their outputs. Unfortunately, MySpider2 relies on the FEED from MySpider1 and therefore can only run after MySpider1…

python scrapy twisted

asked Jun 07 '20 at 22:36
ibTRACK

35

5

3
votes

2 answers

How to get a right video url of an Instagram post using python

I am trying to build a program which runs a function that input a url of a post, output the links of images and videos the post contain. It works really good for images. However, when it comes to get the links of videos, it return me a wrong url. I…

python-3.x web-scraping scrapy

asked Jun 07 '20 at 09:07
youngmac

33

1

6

3
votes

0 answers

how to use cloudscraper together with scrapy

I'm trying to parse data from a site,I use scrapy, but the site is protected by cloudflare. I found a solution, use cloudscraper, and this cloudscraper can really get around protection. But I don’t understand how it can be used with scrapy. Trying…

scrapy scrapy-splash

asked May 23 '20 at 19:29
sborka modeli

31

1

3

3
votes

1 answer

Scrapyd fails on depricated settings in environment variables

We run scrapy 2.1.0 and scrapyd in python 3.6 on ubuntu 18.04 and I ran into a problem that I need help understanding how to solve the right way. I'm new to python (coming from other languages) so please speak slowly and loudly so I understand…

scrapy scrapyd

asked May 22 '20 at 12:46
Kalle

452

2

4

19

3
votes

1 answer

While trying to test Scrapy Web-Crawler on AWS Lambda got this error "raise error.reactornotrestartable() "

I deployed my web-crawler to the AWS Lambda. Then While testing, it ran correctly for the first time, but the second time it gave this error. raise error.reactornotrestartable() twisted.internet.error.reactornotrestartable in AWS lambda File…

python aws-lambda scrapy web-crawler

asked May 18 '20 at 13:19
vaibhav mittal

41

5

3
votes

0 answers

Unable to download scrapyd on ec2

sudo apt-get update && sudo apt-get install scrapyd Err:7 http://archive.scrapy.org/ubuntu precise InRelease Could not connect to archive.scrapy.org:80 (31.125.20.14), connection timed out Err:8 http://archive.scrapy.org/ubuntu scrapy…

python scrapy

asked May 16 '20 at 17:57

user11322373

3
votes

1 answer

Scrape from wordpress site with scrapy

I want to scrape an wordpress site with scrapy. My problem is that I want the heading, text, date and author. The author data is not printed on the main article and the whole text is not in the short version. So i have to copy author first then…

python web-scraping scrapy

asked May 14 '20 at 12:39
Nisse Karlsson

139

2

15

Prev 1 2 3
…
99 100 Next