Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

Designed with simplicity in mind
Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
Portable, open-source, 100% Python
Written in python and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.

Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using condaconda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.

Online resources:

Official site
Official docs
Git Repository
FAQ (see also Recent tab of scrapy tag)
Tutorial for beginners
Curated Scrapy links (libraries, related projects, etc)

17743 questions

votes

3 answers

Persist items using a POST request within a Pipeline

I want to persist items within a Pipeline posting them to a url. I am using this code within the Pipeline class XPipeline(object): def process_item(self, item, spider): log.msg('in SpotifylistPipeline', level=log.DEBUG) yield…

scrapy

asked Aug 04 '11 at 19:40

Migsy

votes

2 answers

Using Python2 and scrapy ImportError: cannot import name suppress

Hi am trying to run a scraper on ubuntu/windows machine . I have installed scrapy version- Scrapy 1.8.0 on using python2. I am able create a project, but when I run a scraper this error in shown. Traceback (most recent call last): File…

python-2.7 ubuntu scrapy

asked Jul 07 '21 at 15:19

imgroot

votes

0 answers

running scrapy CrawlerProcess as async

i would like to run scrapy along with another asyncio script in the same file, but am unable to. (using the asyncio reactor in the settings: # TWISTED_REACTOR =…

scrapy python-asyncio

asked Jun 23 '21 at 11:54

fogx

1,749
2
16
38

votes

1 answer

Getting latest chrome user agent for Scrapy in python or other wise

Recently I have started to use Scrapy on a regular basis to analyze sites which demand the latest browser (user agent) for their content to show up. Now, this may seem like an old time problem, yet up-to-date the issue is quite open. Why? There is…

scrapy user-agent

asked Jun 21 '21 at 10:22

rubmz

1,947
5
27
49

votes

2 answers

Docker Authentication required error when pulling image from dockerhub

I am on Windows and trying to pull spcrapy-splash base image with powershell. Command is : docker pull scrapinghub/splash I have docker desktop running. And I did docker login and successfully logged in. However every time I get this error on…

docker scrapy dockerhub scrapy-splash

asked May 12 '21 at 10:40

bdemirka

votes

2 answers

Scrapy: Can someone tell me why this code does not let me scrape the subsequent pages?

I'm a beginner learning how to webscrape using Scrapy in Python. Can someone point out what's wrong? My goal is to scrape all the subsequent pages. from indeed.items import IndeedItem import scrapy class IndeedSpider(scrapy.Spider): name =…

python scrapy

asked Apr 25 '21 at 01:26

filo babo

votes

2 answers

Scrapy: why I can't extract my targeted data from weather underground?

I am new to Python and web scraping and this is my first ever question on stackoverflow. I watched several tutorials and then I tried to extract data from the table on this page: https://www.wunderground.com/hourly/ir/tehran/date/2021-04-14. The…

python web-scraping scrapy scrapy-shell

asked Apr 13 '21 at 13:22

Neil

votes

2 answers

Scrapy crawling through pages with PostBack data javascript url doesn't change

I'm crawling through some directories with ASP.NET programming via Scrapy. The pages to crawl through are encoded as such: javascript:__doPostBack('MoreInfoListZbgs1$Pager','X') where X is an int between 1 and 180. The problem is that the url…

python scrapy web-crawler

asked Mar 11 '21 at 02:06

Lance Liao

votes

2 answers

Python CrawlSpider

I've been learning how to use scrapy though I had minimal experience in python to begin with. I started learning how to scrape using the BaseSpider. Now I'm trying to crawl websites but I've encountered a problem that has really confuzzled me. Here…

python web-scraping web-crawler scrapy

asked Jul 11 '11 at 16:46

ProgrammingAnt

votes

1 answer

rotating proxies with scrapy with authentication

just a noob question but I can't seem to find the answer googling. how can i use this package https://pypi.org/project/scrapy-rotating-proxies/ if the proxy requires a user/password? do I just put it in the rotating list like…

python scrapy

asked Feb 24 '21 at 00:40

nivekdrol

votes

1 answer

Assigning data from Scrapy spider to a variable

I'm running a Scrapy spider inside a script and I want to assign the scraped data to a variable, rather than output to a file, and read that file to get the data. Right now the spider is outputting the data to a json file, I then read this data,…

python scrapy

asked Feb 01 '21 at 19:49

kevin_insert_somethin_fun_here

votes

1 answer

Crawling issue with loading page using Python (wait up to 5 seconds)

I am trying to crawl the webpage https://sec.report/, which seems to be protected by a certain server configuration. (I need the data for my master thesis). I have a list of company names, which I would like to get certain identifiers (CIK) from the…

python selenium scrapy web-crawler ddos

asked Jan 03 '21 at 20:52

lkick

votes

3 answers

How do you extract an embedded attribute value from a previous attribute value in an XPath query?

I'm trying to "select" the link from the onclick attribute in the following portion of html

but can't get any further than the…

python html xpath scrapy scraper

asked Jul 02 '11 at 01:14

emish

2,813
5
28
34

votes

2 answers

How can I scrape the text from this popup window? [Python and Scrapy]

Please note - I'm very unexperienced and this is my first 'real' project. I'm going to try to explain my problem as best as I can, apologies if some of the terms are incorrect. I'm trying to scrape the following webpage -…

python web-scraping scrapy fancybox

asked Dec 26 '20 at 14:21

Vilje Visser

votes

3 answers

Using Scrapy to parse site, follow Next Page, write as XML

My script works wonderfully when I comment one piece of code: return items. Here is my code, changing to http://example.com since it appears that is what other people to possibly to preserve the 'scraping' legality issues. class…

python request yield scrapy

asked Jun 30 '11 at 03:23

Geo99M6Z

Prev 1 2 3

…

99 100 Next