Scrapy: scrape items from HTML and not from URL

Question

I came across Scrapy with requirement of crawling and scraping both. But according to application requirement I decided not to go with Monolithic approach. Everything should be service based. So I decided to design two services.

Get all urls and html. upload on s3.
Scrape items from html

Why? Simple, today I decided to scrape 10 items out of it, tomorrow I want to scrape 20 (application requirement). In this case I do not want to crawl url and html again as html is going to be same (am crawling only blog sites in which only comments get added and content remains same per url).

First service would be based on Scrapy. I was looking if we could use same for scraping if we can provide html instead of start url or we have to go with BeatifulSoap or some other scraping library.

If your html sources are stored on s3 you can still use scrapy to download them and crawl them asynchronously :) — Granitosaurus, Jul 22 '16 at 11:58

score 1 · Answer 1 · answered Jul 22 '16 at 10:26

Scrapy selectors (allowing extracting data from HTML/XML) are now packaged as an independent project called parsel.

If you can provide Unicode HTML strings from S3 to a parsel.Selector, you can do the same data extraction as in a "regular" live scrapy project.

Example from the docs:

>>> from parsel import Selector
>>> sel = Selector(text=u"""<html>
        <body>
            <h1>Hello, Parsel!</h1>
            <ul>
                <li><a href="http://example.com">Link 1</a></li>
                <li><a href="http://scrapy.org">Link 2</a></li>
            </ul
        </body>
        </html>""")
>>>
>>> sel.css('h1::text').extract_first()
u'Hello, Parsel!'
>>>
>>> sel.css('h1::text').re('\w+')
[u'Hello', u'Parsel']
>>>
>>> for e in sel.css('ul > li'):
        print(e.xpath('.//a/@href').extract_first())
http://example.com
http://scrapy.org

Looking at it. But still it would be another library like BeatifulSoap etc. — SangamAngre, Jul 22 '16 at 12:18

Scrapy: scrape items from HTML and not from URL

1 Answers1