2

I need to parse hundreds of HTML files that are archived on a server. The files are accessed via UNC, and then I use pathlib's as_uri() method to convert the UNC path to as URI.

Full UNC path for example below: \\dmsupportfs\~images\sandbox\test.html

from urllib.request import urlopen
from bs4 import BeautifulSoup
import os, pathlib

source_path = os.path.normpath('//dmsupportfs/~images/sandbox/') + os.sep
filename = 'test.html'

full_path = source_path + filename
url = pathlib.Path(full_path).as_uri()
print('URL -> ' + url)
url_html = urlopen(url).read()

So the URI(L) I'm passing to urlopen is: file://dmsupportfs/%7Eimages/sandbox/test.html

I can plug this into any web browser and return the page, however, when urlopen goes to read the page, it's ignoring/removing the server name (dmsupportfs) from the URI, and so the read fails with not able to find the file. I assume this is something with how the urlopen method processes the URI, but I'm stumped at this point (likely something quick and easy to resolve...sorry, a bit new to Python). If I map the UNC location to a drive letter, and then use the mapped drive letter instead of the UNC path, this works without any issue. I'd like to not have to depend on a mapped drive to accomplish this though. Any advice?

Below is the output from the above code showing the error:

Traceback (most recent call last):
  File "C:\Anaconda3\lib\urllib\request.py", line 1474, in open_local_file
    stats = os.stat(localfile)
FileNotFoundError: [WinError 3] The system cannot find the path specified: '\\~images\\sandbox\\test.html'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "url_test.py", line 10, in <module>
    url_html = urlopen(url).read()
  File "C:\Anaconda3\lib\urllib\request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Anaconda3\lib\urllib\request.py", line 526, in open
    response = self._open(req, data)
  File "C:\Anaconda3\lib\urllib\request.py", line 544, in _open
    '_open', req)
  File "C:\Anaconda3\lib\urllib\request.py", line 504, in _call_chain
    result = func(*args)
  File "C:\Anaconda3\lib\urllib\request.py", line 1452, in file_open
    return self.open_local_file(req)
  File "C:\Anaconda3\lib\urllib\request.py", line 1491, in open_local_file
    raise URLError(exp)
urllib.error.URLError: <urlopen error [WinError 3] The system cannot find the path specified: '\\~images\\sandbox\\test.html'>

UPDATE: So, digging through the traceback above and the actual methods and I find this, which essentially tells me what I'm trying to do with a file:// URI isn't going to work for a remote server.

def file_open(self, req):
    url = req.selector
    if url[:2] == '//' and url[2:3] != '/' and (req.host and
            req.host != 'localhost'):
        if not req.host in self.get_names():
            raise URLError("file:// scheme is supported only on localhost")

Any ideas then on how to get this to work without mapping a drive?

1 Answers1

1

So I replaced this:

url = pathlib.Path(full_path).as_uri()    
url_html = urlopen(url).read()

with this:

with open(full_path) as url_html

and was able to pass that into BeautifulSoup and parse as needed...