Python urlparse: small issue

Question

I'm making an app that parses html and gets images from it. Parsing is easy using Beautiful Soup and downloading of the html and the images works too with urllib2.

I do have a problem with urlparse to make absolute paths out of relative ones. The problem is best explained with an example:

>>> import urlparse
>>> urlparse.urljoin("http://www.example.com/", "../test.png")
'http://www.example.com/../test.png'

As you can see, urlparse doesn't take away the ../ away. This gives a problem when I try to download the image:

HTTPError: HTTP Error 400: Bad Request

Is there a way to fix this problem in urllib?

A relative href="../test.png" works but not href="http://www.example.com/../test.png" ? — Paulo Scardine, Nov 06 '10 at 17:46

score 3 · Answer 1 · answered Nov 06 '10 at 17:30

3

".." would bring you up one directory ("." is current directory), so combining that with a domain name url doesn't make much sense. Maybe what you need is:

>>> urlparse.urljoin("http://www.example.com","./test.png")
'http://www.example.com/test.png'

answered Nov 06 '10 at 17:30

rtpg

2,419
1
18
31

While this is a solution, this won't work in my case: my application has to be able to retrieve images from any websites.I can't just replace "../" by "./" because this would break for other sites where it is actually supposed to go look at the parent directory. – Mew Nov 06 '10 at 17:35
urlparse.urljoin("http://www.example.com/dir/","../test.png") works for me ( I get 'http://www.example.com/test.png'). I guess it's just that ".." doesn't mean anything in the context you have (what is one directory up the base one). At least I don't think it does. – rtpg Nov 06 '10 at 17:41

vhallac · Accepted Answer · 2010-11-06T17:55:00.520

2

I think the best you can do is to pre-parse the original URL, and check the path component. A simple test is

if len(urlparse.urlparse(baseurl).path) > 1:

Then you can combine it with the indexing suggested by demas. For example:

start_offset = (len(urlparse.urlparse(baseurl).path) <= 1) and 2 or 0
img_url = urlparse.urljoin("http://www.example.com/", "../test.png"[start_offset:])

This way, you will not attempt to go to the parent of the root URL.

edited Nov 06 '10 at 17:55

answered Nov 06 '10 at 17:48

vhallac

13,301
3
25
36

Thanks, I'll go this route and implement something like that. – Mew Nov 06 '10 at 18:09

jfs · Answer 3 · 2010-11-07T20:06:51.093

1

If you'd like that /../test would mean the same as /test like paths in a file system then you could use normpath():

>>> url = urlparse.urljoin("http://example.com/", "../test")
>>> p = urlparse.urlparse(url)
>>> path = posixpath.normpath(p.path)
>>> urlparse.urlunparse((p.scheme, p.netloc, path, p.params, p.query,p.fragment))
'http://example.com/test'

edited Nov 07 '10 at 20:06

answered Nov 07 '10 at 19:50

jfs

399,953
195
994
1,670

score 0 · Answer 4 · answered Nov 06 '10 at 17:31

0

urlparse.urljoin("http://www.example.com/", "../test.png"[2:])

It is what you need?

answered Nov 06 '10 at 17:31

ceth

44,198
62
180
289

This has the same problem as Dasuraga's solution: it would only work for that certain website, while breaking others. – Mew Nov 06 '10 at 17:37

Python urlparse: small issue

4 Answers4

Linked