31

For example i had a site "www.example.com" Actually i want to scrape the html of this site by saving on to local system. so for testing i saved that page on my desktop as example.html

Now i had written the spider code for this as below

class ExampleSpider(BaseSpider):
   name = "example"
   start_urls = ["example.html"]

   def parse(self, response):
       print response
       hxs = HtmlXPathSelector(response)

But when i run the above code i am getting this error as below

ValueError: Missing scheme in request url: example.html

Finally my intension is to scrape the example.html file that consists of www.example.com html code saved in my local system

Can any one suggest me on how to assign that example.html file in start_urls

Thanks in advance

Shiva Krishna Bavandla
  • 25,548
  • 75
  • 193
  • 313

7 Answers7

36

You can crawl a local file using an url of the following form:

 file:///path/to/file.html
iodbh
  • 688
  • 1
  • 7
  • 14
  • 8
    it doesn't work, but file:///path/to/file.html - does – d-d Feb 06 '18 at 07:38
  • 1
    You don't need the `127.0.0.1` unless you're serving the file over http, which then requires `http://127.0.0.1/...` instead of `file:///...` – Shadi Jul 05 '18 at 10:42
  • @iodbh It seems by the comments that the answer should be updated as file:///path/to/file.html. – Way Too Simple Oct 24 '19 at 12:21
  • @Armando Thanks for @'ing me, I've edited the answer. Interestingly, at the time of writing it, the odd `file:///127.0.0.1/...` was the correct way to do this for some mysterious reason. – iodbh Oct 24 '19 at 16:17
  • is there any way to crawl/get a local file with only the relative path? – oldboy May 23 '21 at 22:35
  • 1
    For example: start_urls = ['file:///home/vagrant/code/project/storage/scraping.htm'] – Sadee Dec 20 '21 at 15:49
14

You can use the HTTPCacheMiddleware, which will give you the ability to to a spider run from cache. The documentation for the HTTPCacheMiddleware settings is located here.

Basically, adding the following settings to your settings.py will make it work:

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0 # Set to 0 to never expire

This however requires to do an initial spider run from the web to populate the cache.

Sjaak Trekhaak
  • 4,906
  • 30
  • 39
6

In scrapy, You can scrape local file using:

class ExampleSpider(BaseSpider):
   name = "example"
   start_urls = ["file:///path_of_directory/example.html"]

   def parse(self, response):
       print response
       hxs = HtmlXPathSelector(response)

I suggest you check it using scrapy shell 'file:///path_of_directory/example.html'

Gautam Kumar
  • 525
  • 4
  • 9
4

Just to share the way that I like to do this scraping with local files:

import scrapy
import os

LOCAL_FILENAME = 'example.html'
LOCAL_FOLDER = 'html_files'
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))


class ExampleSpider(scrapy.Spider):
    name = "example"
    start_urls = [
        f"file://{BASE_DIR}/{LOCAL_FOLDER}/{LOCAL_FILENAME}"
    ]

I'm using f-strings (python 3.6+)(https://www.python.org/dev/peps/pep-0498/), but you can change with %-formatting or str.format() as you prefer.

neosergio
  • 452
  • 5
  • 15
1
scrapy shell "file:E:\folder\to\your\script\Scrapy\teste1\teste1.html"

this works for me today on Windows 10. I have to put the full path without the ////.

Rhuan Barros
  • 171
  • 1
  • 6
0

You can simple do

def start_requests(self):
    yield Request(url='file:///path_of_directory/example.html')
Umair Ayub
  • 19,358
  • 14
  • 72
  • 146
-6

If you view source code of scrapy Request for example github . You can understand what scrapy send request to http server and get needed page in response from server. Your filesystem is not http server. For testing purpose with scrapy you must setup http server. And then you can assign urls to scrapy like

http://127.0.0.1/example.html
Denis
  • 7,127
  • 8
  • 37
  • 58