Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen and web framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

  • Designed with simplicity in mind
  • Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
  • Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
  • Portable, open-source, 100% Python
  • Written in and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.


Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using conda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

enter image description here


Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.


Online resources:

17743 questions
3
votes
3 answers

Persist items using a POST request within a Pipeline

I want to persist items within a Pipeline posting them to a url. I am using this code within the Pipeline class XPipeline(object): def process_item(self, item, spider): log.msg('in SpotifylistPipeline', level=log.DEBUG) yield…
Migsy
  • 41
  • 4
3
votes
2 answers

Using Python2 and scrapy ImportError: cannot import name suppress

Hi am trying to run a scraper on ubuntu/windows machine . I have installed scrapy version- Scrapy 1.8.0 on using python2. I am able create a project, but when I run a scraper this error in shown. Traceback (most recent call last): File…
imgroot
  • 31
  • 4
3
votes
0 answers

running scrapy CrawlerProcess as async

i would like to run scrapy along with another asyncio script in the same file, but am unable to. (using the asyncio reactor in the settings: # TWISTED_REACTOR =…
fogx
  • 1,749
  • 2
  • 16
  • 38
3
votes
1 answer

Getting latest chrome user agent for Scrapy in python or other wise

Recently I have started to use Scrapy on a regular basis to analyze sites which demand the latest browser (user agent) for their content to show up. Now, this may seem like an old time problem, yet up-to-date the issue is quite open. Why? There is…
rubmz
  • 1,947
  • 5
  • 27
  • 49
3
votes
2 answers

Docker Authentication required error when pulling image from dockerhub

I am on Windows and trying to pull spcrapy-splash base image with powershell. Command is : docker pull scrapinghub/splash I have docker desktop running. And I did docker login and successfully logged in. However every time I get this error on…
bdemirka
  • 817
  • 2
  • 9
  • 24
3
votes
2 answers

Scrapy: Can someone tell me why this code does not let me scrape the subsequent pages?

I'm a beginner learning how to webscrape using Scrapy in Python. Can someone point out what's wrong? My goal is to scrape all the subsequent pages. from indeed.items import IndeedItem import scrapy class IndeedSpider(scrapy.Spider): name =…
filo babo
  • 31
  • 1
3
votes
2 answers

Scrapy: why I can't extract my targeted data from weather underground?

I am new to Python and web scraping and this is my first ever question on stackoverflow. I watched several tutorials and then I tried to extract data from the table on this page: https://www.wunderground.com/hourly/ir/tehran/date/2021-04-14. The…
Neil
  • 49
  • 6
3
votes
2 answers

Scrapy crawling through pages with PostBack data javascript url doesn't change

I'm crawling through some directories with ASP.NET programming via Scrapy. The pages to crawl through are encoded as such: javascript:__doPostBack('MoreInfoListZbgs1$Pager','X') where X is an int between 1 and 180. The problem is that the url…
Lance Liao
  • 31
  • 3
3
votes
2 answers

Python CrawlSpider

I've been learning how to use scrapy though I had minimal experience in python to begin with. I started learning how to scrape using the BaseSpider. Now I'm trying to crawl websites but I've encountered a problem that has really confuzzled me. Here…
3
votes
1 answer

rotating proxies with scrapy with authentication

just a noob question but I can't seem to find the answer googling. how can i use this package https://pypi.org/project/scrapy-rotating-proxies/ if the proxy requires a user/password? do I just put it in the rotating list like…
nivekdrol
  • 41
  • 5
3
votes
1 answer

Assigning data from Scrapy spider to a variable

I'm running a Scrapy spider inside a script and I want to assign the scraped data to a variable, rather than output to a file, and read that file to get the data. Right now the spider is outputting the data to a json file, I then read this data,…
3
votes
1 answer

Crawling issue with loading page using Python (wait up to 5 seconds)

I am trying to crawl the webpage https://sec.report/, which seems to be protected by a certain server configuration. (I need the data for my master thesis). I have a list of company names, which I would like to get certain identifiers (CIK) from the…
lkick
  • 31
  • 2
3
votes
3 answers

How do you extract an embedded attribute value from a previous attribute value in an XPath query?

I'm trying to "select" the link from the onclick attribute in the following portion of html but can't get any further than the…
emish
  • 2,813
  • 5
  • 28
  • 34
3
votes
2 answers

How can I scrape the text from this popup window? [Python and Scrapy]

Please note - I'm very unexperienced and this is my first 'real' project. I'm going to try to explain my problem as best as I can, apologies if some of the terms are incorrect. I'm trying to scrape the following webpage -…
3
votes
3 answers

Using Scrapy to parse site, follow Next Page, write as XML

My script works wonderfully when I comment one piece of code: return items. Here is my code, changing to http://example.com since it appears that is what other people to possibly to preserve the 'scraping' legality issues. class…
Geo99M6Z
  • 43
  • 1
  • 4