Distinguish a filename from an URL

Question

In WeasyPrint’s public API I accept either filenames or URLs (among other types) for the HTML input:

document = HTML(filename='/foo/bar/baz.html')
document = HTML(url='http://example.net/bar/baz.html')

There is also the option not to name the argument and let WeasyPrint guess its type:

document = HTML(sys.argv[1])

Some cases are easy: if it starts with a / on Unix it’s a filename, if it starts with http:// it’s probably an URL. But we need an general algorithm that gives an answer for any string.

Currently I try to match this regexp: ^([a-z][a-z0-1.+-]*):. A string that matches starts with a valid URI scheme according to RFC 3986 (URI). This is not bad on Unix, but utterly fails on Windows: C:\foo\bar.html matches and is treated like an URL.

I could change the * to + in the regexp and only match URI schemes that are at least two characters long. Apparently there is no known URI scheme shorter than that.

Or is there a better criteria? Maybe I should just restrict "guessed" URLs to a handful of schemes. More exotic cases can still use HTML(url=foo).

url.startswith(['http:', 'https:', 'ftp:', 'data:'])

Possible duplicate of [Argument is URL or path](https://stackoverflow.com/questions/7849818/argument-is-url-or-path) — akaihola, Aug 15 '18 at 13:10

score 4 · Accepted Answer · answered Jul 27 '12 at 12:42

4

If you really must guess well between filenames and URLs, I'd say a string with 2 or more word characters and then a colon was a URL, anything else is a file, just as you suggest.

Another option: try to open it as a file. If it fails, try to open it as a URL.

Better might be to listen to the Zen of Python, "resist the temptation to guess". Doesn't the caller know if he's talking about a filename or a URL? Have them specify it.

answered Jul 27 '12 at 12:42

Ned Batchelder

364,293
75
561
662

I like both suggestions, thanks. The caller should know and can name the argument to avoid the guessing, I’d just like to give more option. In particular the guessing is used when taking strings from `sys.argv` in the command-line API. – Simon Sapin Jul 27 '12 at 13:04

score 2 · Answer 2 · answered Jul 27 '12 at 12:40

2

The correct thing is to accept file-like objects, not paths.

Then I can pass you a file, a retrieved URL, or some other thing you haven't thought of.

answered Jul 27 '12 at 12:40

Julian

3,375
16
27

Actually [the API](http://weasyprint.org/using/#the-weasyprint-html-class) also accept file-like objects. Either explicit `HTML(file_obj=foo)` or in the unnamed argument if it has a `read()` method. Still, having a filename or an URLs in a string can be more convenient, especially for the command-line API. – Simon Sapin Jul 27 '12 at 12:51
Seems like it's just making you work more to gain no extra functionality then :) (and speaking as a user, questionable convenience). For a command line API I'd expect you to be opening a file for the user using `argv`, yes, but that'd be before passing to this function. – Julian Jul 27 '12 at 12:57
Indeed there are more important functionality in this project :) For the command-line API do you mean trying to `open()` the file and treat it as an URL if it fails, as Ned suggested? – Simon Sapin Jul 27 '12 at 13:08
That'd work, yep. Or you could add a `--url` flag if that looks easier to you. – Julian Jul 27 '12 at 13:10
An `--url` flag is the kind of burden for the user I want to avoid. `weasyprint http://acid2.acidtests.org/ out.pdf` and `weasyprint ../foo.html out.pdf` should both Just Work®, IMO. – Simon Sapin Jul 27 '12 at 13:12
Well, then the other option :). – Julian Jul 27 '12 at 13:13

score 0 · Answer 3 · answered Jul 27 '12 at 12:57

0

You could check the scheme if you wanted from urlparse if you want.

from urlparse import urlparse

scheme = urlparse(url).scheme
if not scheme or scheme=='file':
    pass # treat it as a file

answered Jul 27 '12 at 12:57

Jon Clements

138,671
33
247
280

You seem to assume that everything is a proper URL and some of these have the `file://` scheme. file URLs are fine, but they are just URLs. I’m interested in filenames in OS-specific format. – Simon Sapin Jul 27 '12 at 12:59
1

It doesn't work with Windows paths `>>> urllib.parse.urlparse("C:\WinPython\basedir35").scheme` `'c'` – rlaverde Jul 16 '17 at 13:30

Distinguish a filename from an URL

3 Answers3

Linked