0

Having downloaded the HTML to my harddisk with Scrapy (e.g., using the builtin Item Exporters with a field HTML, or storing all HTML files to a folder), how can I use Scrapy to read the data from my harddisk again and execute the next step in the pipeline? Is there something like an Item Importer?

David
  • 1,238
  • 2
  • 13
  • 20
  • Not really an anwser about an "item importer", but [`HTTPCACHE_ENABLED=True`](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#std:setting-HTTPCACHE_ENABLED) activates (by default) a file-system based cache of HTTP responses, so you can replay a crawl without much effort. – paul trmbrth Jun 20 '17 at 09:35
  • What I do not like about httpcache is that it stores thousands of files and they are not human readable. I would prefer a single, human readable file. – David Jun 20 '17 at 13:57

1 Answers1

2

If the HTML pages are stored on the local PC, where you run Scrapy from, you can scrape the URIs like:

file:///tmp/page1.html

using Scrapy. In this example, I assume one such page is stored in the file /tmp/page1.html.

The second option is to use whatever way to get the content of the files and manually build a Selector object like this:

import scrapy

# read the content of the page into page_content variable
root_sel = scrapy.Selector(text=page_content)

You can then normally process the root_sel selector, e.g.

title = root_sel.css('h1.title').extract_first()
Tomáš Linhart
  • 9,832
  • 1
  • 27
  • 39
  • I tried option 1 to store all the URIs. The problem is that some URIs contain special characters and my backup solution is not able to process them. – David Jun 20 '17 at 13:59
  • Regarding the second option, is there a way to start the spider from the command line similar to "scrapy crawl spider -o items.json"? – David Jun 20 '17 at 14:01
  • Is there another library like lxml or BeautifulSoup that is better able to achieve this? – David Jun 20 '17 at 14:09
  • @David You can use the usual `scrapy crawl spider` way with the second option. All the content loading (reading from stored HTML pages) and parsing can take place in a loop inside `start_requests` method. – Tomáš Linhart Jun 20 '17 at 14:26