In WeasyPrint’s public API I accept either filenames or URLs (among other types) for the HTML input:
document = HTML(filename='/foo/bar/baz.html')
document = HTML(url='http://example.net/bar/baz.html')
There is also the option not to name the argument and let WeasyPrint guess its type:
document = HTML(sys.argv[1])
Some cases are easy: if it starts with a /
on Unix it’s a filename, if it starts with http://
it’s probably an URL. But we need an general algorithm that gives an answer for any string.
Currently I try to match this regexp: ^([a-z][a-z0-1.+-]*):
. A string that matches starts with a valid URI scheme according to RFC 3986 (URI). This is not bad on Unix, but utterly fails on Windows: C:\foo\bar.html
matches and is treated like an URL.
I could change the *
to +
in the regexp and only match URI schemes that are at least two characters long. Apparently there is no known URI scheme shorter than that.
Or is there a better criteria? Maybe I should just restrict "guessed" URLs to a handful of schemes. More exotic cases can still use HTML(url=foo)
.
url.startswith(['http:', 'https:', 'ftp:', 'data:'])