2

I'm making an app that parses html and gets images from it. Parsing is easy using Beautiful Soup and downloading of the html and the images works too with urllib2.

I do have a problem with urlparse to make absolute paths out of relative ones. The problem is best explained with an example:

>>> import urlparse
>>> urlparse.urljoin("http://www.example.com/", "../test.png")
'http://www.example.com/../test.png'

As you can see, urlparse doesn't take away the ../ away. This gives a problem when I try to download the image:

HTTPError: HTTP Error 400: Bad Request

Is there a way to fix this problem in urllib?

Mew
  • 1,049
  • 7
  • 17

4 Answers4

3

".." would bring you up one directory ("." is current directory), so combining that with a domain name url doesn't make much sense. Maybe what you need is:

>>> urlparse.urljoin("http://www.example.com","./test.png")
'http://www.example.com/test.png'
rtpg
  • 2,419
  • 1
  • 18
  • 31
  • While this is a solution, this won't work in my case: my application has to be able to retrieve images from any websites.I can't just replace "../" by "./" because this would break for other sites where it is actually supposed to go look at the parent directory. – Mew Nov 06 '10 at 17:35
  • urlparse.urljoin("http://www.example.com/dir/","../test.png") works for me ( I get 'http://www.example.com/test.png'). I guess it's just that ".." doesn't mean anything in the context you have (what is one directory up the base one). At least I don't think it does. – rtpg Nov 06 '10 at 17:41
2

I think the best you can do is to pre-parse the original URL, and check the path component. A simple test is

if len(urlparse.urlparse(baseurl).path) > 1:

Then you can combine it with the indexing suggested by demas. For example:

start_offset = (len(urlparse.urlparse(baseurl).path) <= 1) and 2 or 0
img_url = urlparse.urljoin("http://www.example.com/", "../test.png"[start_offset:])

This way, you will not attempt to go to the parent of the root URL.

vhallac
  • 13,301
  • 3
  • 25
  • 36
1

If you'd like that /../test would mean the same as /test like paths in a file system then you could use normpath():

>>> url = urlparse.urljoin("http://example.com/", "../test")
>>> p = urlparse.urlparse(url)
>>> path = posixpath.normpath(p.path)
>>> urlparse.urlunparse((p.scheme, p.netloc, path, p.params, p.query,p.fragment))
'http://example.com/test'
jfs
  • 399,953
  • 195
  • 994
  • 1,670
0
urlparse.urljoin("http://www.example.com/", "../test.png"[2:])

It is what you need?

ceth
  • 44,198
  • 62
  • 180
  • 289
  • This has the same problem as Dasuraga's solution: it would only work for that certain website, while breaking others. – Mew Nov 06 '10 at 17:37