I need to parse hundreds of HTML files that are archived on a server. The files are accessed via UNC, and then I use pathlib's as_uri() method to convert the UNC path to as URI.
Full UNC path for example below: \\dmsupportfs\~images\sandbox\test.html
from urllib.request import urlopen
from bs4 import BeautifulSoup
import os, pathlib
source_path = os.path.normpath('//dmsupportfs/~images/sandbox/') + os.sep
filename = 'test.html'
full_path = source_path + filename
url = pathlib.Path(full_path).as_uri()
print('URL -> ' + url)
url_html = urlopen(url).read()
So the URI(L) I'm passing to urlopen is: file://dmsupportfs/%7Eimages/sandbox/test.html
I can plug this into any web browser and return the page, however, when urlopen goes to read the page, it's ignoring/removing the server name (dmsupportfs) from the URI, and so the read fails with not able to find the file. I assume this is something with how the urlopen method processes the URI, but I'm stumped at this point (likely something quick and easy to resolve...sorry, a bit new to Python). If I map the UNC location to a drive letter, and then use the mapped drive letter instead of the UNC path, this works without any issue. I'd like to not have to depend on a mapped drive to accomplish this though. Any advice?
Below is the output from the above code showing the error:
Traceback (most recent call last):
File "C:\Anaconda3\lib\urllib\request.py", line 1474, in open_local_file
stats = os.stat(localfile)
FileNotFoundError: [WinError 3] The system cannot find the path specified: '\\~images\\sandbox\\test.html'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "url_test.py", line 10, in <module>
url_html = urlopen(url).read()
File "C:\Anaconda3\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "C:\Anaconda3\lib\urllib\request.py", line 526, in open
response = self._open(req, data)
File "C:\Anaconda3\lib\urllib\request.py", line 544, in _open
'_open', req)
File "C:\Anaconda3\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Anaconda3\lib\urllib\request.py", line 1452, in file_open
return self.open_local_file(req)
File "C:\Anaconda3\lib\urllib\request.py", line 1491, in open_local_file
raise URLError(exp)
urllib.error.URLError: <urlopen error [WinError 3] The system cannot find the path specified: '\\~images\\sandbox\\test.html'>
UPDATE: So, digging through the traceback above and the actual methods and I find this, which essentially tells me what I'm trying to do with a file:// URI isn't going to work for a remote server.
def file_open(self, req):
url = req.selector
if url[:2] == '//' and url[2:3] != '/' and (req.host and
req.host != 'localhost'):
if not req.host in self.get_names():
raise URLError("file:// scheme is supported only on localhost")
Any ideas then on how to get this to work without mapping a drive?