Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen and web framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

  • Designed with simplicity in mind
  • Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
  • Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
  • Portable, open-source, 100% Python
  • Written in and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.


Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using conda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

enter image description here


Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.


Online resources:

17743 questions
3
votes
1 answer

Python Scrapy : allowed_domains adding new domains from database

I need to add more domains to allowed_domains , so I dnt get the " Filtered offsite request to ". My app gets urls to fetch from a database, so I cant add them manually. I tried to override the spider init like this def __init__(self): …
llazzaro
  • 3,970
  • 4
  • 33
  • 47
3
votes
0 answers

XML - How to get link from element when there is no link on href

I have this Html above and I need to get the link to…
João Koritar
  • 89
  • 1
  • 7
3
votes
1 answer

How to pause spider in Scrapy

I'm new in scrapy and I need to pause a spider after receiving a response error (like 407, 429). Also, I should do this without using time.sleep(), and use middlewares or extensions. Here is my middlewares: from scrapy import signals from pydispatch…
Daniil
  • 51
  • 3
3
votes
1 answer

Can't grab the title of different listings from a webpage using scrapy

I'm trying to parse the title of different listings from this webpage. The titles are not dynamic as they are available in page source. However, it is necessary to send cookies in the first place to grab the titles. I've tried with the following way…
SMTH
  • 67
  • 1
  • 4
  • 17
3
votes
3 answers

CSS selector or XPath that gets information between two i tags?

I'm trying to scrape price information, and the HTML of the website looks like this $ "999" .00 I want to get 999. (I don't want the dollar sign or the .00) I currently…
Tianhe Xie
  • 261
  • 1
  • 10
3
votes
1 answer

OSX Using os.system to run Scrapy Script

This question is all over SO but I'm not entirely sure if this solution even works. I have a scrapy script that I would like to convert to an .app file. There's a program called Platypus[] that uses a GUI to convert python projects into an…
chrisHG
  • 80
  • 1
  • 2
  • 18
3
votes
2 answers

Get spider list from Scrapy in Django Project

I'm following this answer to get the spider list on my Scrapy Project inside Django, so this is what the structure looks like. my_app/ -- apps/ # django apps folder -- crawler/ -- __init__.py -- admin.py -- apps.py --…
3
votes
1 answer

Scraping data over websockets

I am trying to get the daily price data from this specific webpage: https://www.londonstockexchange.com/stock/CS1/amundi/company-page Those data are represented in the chart. I run out of idea to try to reach those data. I assume that those data are…
AL Ko
  • 31
  • 3
3
votes
1 answer

Scrapy: Run spiders seqential with different settings for each spider

For quite a few days now I'm having trouble with Scrapy/twisted in my Main.py which is supposed to run different spiders and analyze their outputs. Unfortunately, MySpider2 relies on the FEED from MySpider1 and therefore can only run after MySpider1…
ibTRACK
  • 35
  • 5
3
votes
2 answers

How to get a right video url of an Instagram post using python

I am trying to build a program which runs a function that input a url of a post, output the links of images and videos the post contain. It works really good for images. However, when it comes to get the links of videos, it return me a wrong url. I…
youngmac
  • 33
  • 1
  • 6
3
votes
0 answers

how to use cloudscraper together with scrapy

I'm trying to parse data from a site,I use scrapy, but the site is protected by cloudflare. I found a solution, use cloudscraper, and this cloudscraper can really get around protection. But I don’t understand how it can be used with scrapy. Trying…
sborka modeli
  • 31
  • 1
  • 3
3
votes
1 answer

Scrapyd fails on depricated settings in environment variables

We run scrapy 2.1.0 and scrapyd in python 3.6 on ubuntu 18.04 and I ran into a problem that I need help understanding how to solve the right way. I'm new to python (coming from other languages) so please speak slowly and loudly so I understand…
Kalle
  • 452
  • 2
  • 4
  • 19
3
votes
1 answer

While trying to test Scrapy Web-Crawler on AWS Lambda got this error "raise error.reactornotrestartable() "

I deployed my web-crawler to the AWS Lambda. Then While testing, it ran correctly for the first time, but the second time it gave this error. raise error.reactornotrestartable() twisted.internet.error.reactornotrestartable in AWS lambda File…
3
votes
0 answers

Unable to download scrapyd on ec2

sudo apt-get update && sudo apt-get install scrapyd Err:7 http://archive.scrapy.org/ubuntu precise InRelease Could not connect to archive.scrapy.org:80 (31.125.20.14), connection timed out Err:8 http://archive.scrapy.org/ubuntu scrapy…
user11322373
3
votes
1 answer

Scrape from wordpress site with scrapy

I want to scrape an wordpress site with scrapy. My problem is that I want the heading, text, date and author. The author data is not printed on the main article and the whole text is not in the short version. So i have to copy author first then…
Nisse Karlsson
  • 139
  • 2
  • 15