scraping the file with html saved in local system

Question

For example i had a site "www.example.com" Actually i want to scrape the html of this site by saving on to local system. so for testing i saved that page on my desktop as example.html

Now i had written the spider code for this as below

class ExampleSpider(BaseSpider):
   name = "example"
   start_urls = ["example.html"]

   def parse(self, response):
       print response
       hxs = HtmlXPathSelector(response)

But when i run the above code i am getting this error as below

ValueError: Missing scheme in request url: example.html

Finally my intension is to scrape the example.html file that consists of www.example.com html code saved in my local system

Can any one suggest me on how to assign that example.html file in start_urls

Thanks in advance

You could enable the HTTP Cache middleware which will save to your harddisk. You can basically replay a previous scrape run, depending on the timeout you have set for the HTTP Cache middleware. — Sjaak Trekhaak, Jun 05 '12 at 11:08
@Sjaak Trekhaak: Thanks for u r reply, can u give me an example, so that its more helpful — Shiva Krishna Bavandla, Jun 05 '12 at 12:16
i am not sure, but you can try: `start_urls = ["file:///home/local/cname/username/project/scrapy_project_modules/example/exampl‌e.html"]` — warvariuc, Jun 05 '12 at 12:24
@warwaruk:Thanks i already used exactly the same and got the answer — Shiva Krishna Bavandla, Jun 06 '12 at 08:37

iodbh · Answer 1 · 2019-10-24T16:16:10.067

36

You can crawl a local file using an url of the following form:

 file:///path/to/file.html

edited Oct 24 '19 at 16:16

answered Mar 05 '14 at 19:56

iodbh

688
1
7
14

8

it doesn't work, but file:///path/to/file.html - does – d-d Feb 06 '18 at 07:38
1

You don't need the `127.0.0.1` unless you're serving the file over http, which then requires `http://127.0.0.1/...` instead of `file:///...` – Shadi Jul 05 '18 at 10:42
@iodbh It seems by the comments that the answer should be updated as file:///path/to/file.html. – Way Too Simple Oct 24 '19 at 12:21
@Armando Thanks for @'ing me, I've edited the answer. Interestingly, at the time of writing it, the odd `file:///127.0.0.1/...` was the correct way to do this for some mysterious reason. – iodbh Oct 24 '19 at 16:17
is there any way to crawl/get a local file with only the relative path? – oldboy May 23 '21 at 22:35
1

For example: start_urls = ['file:///home/vagrant/code/project/storage/scraping.htm'] – Sadee Dec 20 '21 at 15:49

Sjaak Trekhaak · Answer 2 · 2021-11-08T15:06:40.877

You can use the HTTPCacheMiddleware, which will give you the ability to to a spider run from cache. The documentation for the HTTPCacheMiddleware settings is located here.

Basically, adding the following settings to your settings.py will make it work:

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0 # Set to 0 to never expire

This however requires to do an initial spider run from the web to populate the cache.

score 6 · Answer 3 · answered Nov 20 '18 at 10:39

In scrapy, You can scrape local file using:

class ExampleSpider(BaseSpider):
   name = "example"
   start_urls = ["file:///path_of_directory/example.html"]

   def parse(self, response):
       print response
       hxs = HtmlXPathSelector(response)

I suggest you check it using scrapy shell 'file:///path_of_directory/example.html'

score 4 · Answer 4 · answered May 22 '20 at 19:08

Just to share the way that I like to do this scraping with local files:

import scrapy
import os

LOCAL_FILENAME = 'example.html'
LOCAL_FOLDER = 'html_files'
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))


class ExampleSpider(scrapy.Spider):
    name = "example"
    start_urls = [
        f"file://{BASE_DIR}/{LOCAL_FOLDER}/{LOCAL_FILENAME}"
    ]

I'm using f-strings (python 3.6+)(https://www.python.org/dev/peps/pep-0498/), but you can change with %-formatting or str.format() as you prefer.

score 1 · Answer 5 · answered May 01 '19 at 22:41

1

scrapy shell "file:E:\folder\to\your\script\Scrapy\teste1\teste1.html"

this works for me today on Windows 10. I have to put the full path without the ////.

answered May 01 '19 at 22:41

Rhuan Barros

171
1
6

score 0 · Answer 6 · answered Mar 11 '21 at 05:51

0

You can simple do

def start_requests(self):
    yield Request(url='file:///path_of_directory/example.html')

answered Mar 11 '21 at 05:51

Umair Ayub

19,358
14
72
146

score -6 · Answer 7 · answered Jun 05 '12 at 12:04

-6

If you view source code of scrapy Request for example github . You can understand what scrapy send request to http server and get needed page in response from server. Your filesystem is not http server. For testing purpose with scrapy you must setup http server. And then you can assign urls to scrapy like

http://127.0.0.1/example.html

answered Jun 05 '12 at 12:04

Denis

7,127
8
37
58

Using `file:///path/to/file.html` works as of Scrapy 1.5.0 – Shadi Jul 05 '18 at 10:42
You can use 'file::///' schema in scrapy instead using another http server. Frankly using web server not will be a good practice. – Rıdvan Nuri Göçmen Dec 28 '18 at 07:07

scraping the file with html saved in local system

7 Answers7