0

I came across Scrapy with requirement of crawling and scraping both. But according to application requirement I decided not to go with Monolithic approach. Everything should be service based. So I decided to design two services.

  1. Get all urls and html. upload on s3.
  2. Scrape items from html

Why? Simple, today I decided to scrape 10 items out of it, tomorrow I want to scrape 20 (application requirement). In this case I do not want to crawl url and html again as html is going to be same (am crawling only blog sites in which only comments get added and content remains same per url).

First service would be based on Scrapy. I was looking if we could use same for scraping if we can provide html instead of start url or we have to go with BeatifulSoap or some other scraping library.

SangamAngre
  • 809
  • 8
  • 25

1 Answers1

1

Scrapy selectors (allowing extracting data from HTML/XML) are now packaged as an independent project called parsel.

If you can provide Unicode HTML strings from S3 to a parsel.Selector, you can do the same data extraction as in a "regular" live scrapy project.

Example from the docs:

>>> from parsel import Selector
>>> sel = Selector(text=u"""<html>
        <body>
            <h1>Hello, Parsel!</h1>
            <ul>
                <li><a href="http://example.com">Link 1</a></li>
                <li><a href="http://scrapy.org">Link 2</a></li>
            </ul
        </body>
        </html>""")
>>>
>>> sel.css('h1::text').extract_first()
u'Hello, Parsel!'
>>>
>>> sel.css('h1::text').re('\w+')
[u'Hello', u'Parsel']
>>>
>>> for e in sel.css('ul > li'):
        print(e.xpath('.//a/@href').extract_first())
http://example.com
http://scrapy.org
paul trmbrth
  • 20,518
  • 4
  • 53
  • 66