Scrapy: How to reproduce results without downloading the html again?

Question

Having downloaded the HTML to my harddisk with Scrapy (e.g., using the builtin Item Exporters with a field HTML, or storing all HTML files to a folder), how can I use Scrapy to read the data from my harddisk again and execute the next step in the pipeline? Is there something like an Item Importer?

Not really an anwser about an "item importer", but [`HTTPCACHE_ENABLED=True`](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#std:setting-HTTPCACHE_ENABLED) activates (by default) a file-system based cache of HTTP responses, so you can replay a crawl without much effort. — paul trmbrth, Jun 20 '17 at 09:35
What I do not like about httpcache is that it stores thousands of files and they are not human readable. I would prefer a single, human readable file. — David, Jun 20 '17 at 13:57

score 2 · Answer 1 · answered Jun 20 '17 at 06:07

2

If the HTML pages are stored on the local PC, where you run Scrapy from, you can scrape the URIs like:

file:///tmp/page1.html

using Scrapy. In this example, I assume one such page is stored in the file /tmp/page1.html.

The second option is to use whatever way to get the content of the files and manually build a Selector object like this:

import scrapy

# read the content of the page into page_content variable
root_sel = scrapy.Selector(text=page_content)

You can then normally process the root_sel selector, e.g.

title = root_sel.css('h1.title').extract_first()

answered Jun 20 '17 at 06:07

Tomáš Linhart

9,832
1
27
39

I tried option 1 to store all the URIs. The problem is that some URIs contain special characters and my backup solution is not able to process them. – David Jun 20 '17 at 13:59
Regarding the second option, is there a way to start the spider from the command line similar to "scrapy crawl spider -o items.json"? – David Jun 20 '17 at 14:01
Is there another library like lxml or BeautifulSoup that is better able to achieve this? – David Jun 20 '17 at 14:09
@David You can use the usual `scrapy crawl spider` way with the second option. All the content loading (reading from stored HTML pages) and parsing can take place in a loop inside `start_requests` method. – Tomáš Linhart Jun 20 '17 at 14:26

Scrapy: How to reproduce results without downloading the html again?

1 Answers1