I came across Scrapy with requirement of crawling and scraping both. But according to application requirement I decided not to go with Monolithic approach. Everything should be service based. So I decided to design two services.
- Get all urls and html. upload on s3.
- Scrape items from html
Why? Simple, today I decided to scrape 10 items out of it, tomorrow I want to scrape 20 (application requirement). In this case I do not want to crawl url and html again as html is going to be same (am crawling only blog sites in which only comments get added and content remains same per url).
First service would be based on Scrapy. I was looking if we could use same for scraping if we can provide html instead of start url or we have to go with BeatifulSoap or some other scraping library.