1

I'm currently teaching myself python (coming from no programming background, but a lot of sys admin, scripting experience) and have gone about creating a script/program that looks at a site and grabs all the images. I've struggled quite a bit but I've gotten it working to an extent.

Now, my current issue is that when I do urllib.urlretrieve(url, out_path) on a url like: http://www.testsite.com/images/img.jpg - it works fine but something like http://www.testsite.com/../images/img.jpg doesn't work. When you hit that path in your browser it works fine, and urllib.urlretrieve retrieves the images but it's broken when you try and open it in an image viewer.

This is my code currently:

http://pastebin.com/E9hutEGn - sorry for the pastebin post, the code was a bit too much and I didn't want to make the post read badly.

Can anyone recognize why it isn't working?

1 Answers1

0

first of all, the pastebin is good (and also is a good reason why you used it).

for your problem, i think that may be an issue of the path joined with the base url. let me explain with an example:

>>>> import urlparse
>>>> base="http://somesite.com/level1/"
>>>> path="../page.html"
>>>> urlparse.urljoin(base,path)
> 'http://somesite.com/page.html'

>>>> base="http://somesite.com/"
>>>> urlparse.urljoin(base,path)
> 'http://somesite.com/../page.html'

so i guess you have to take away your ../ manually

little add: i was searching for your problem around and found this post that may be useful too

Community
  • 1
  • 1
Samuele Mattiuzzo
  • 10,760
  • 5
  • 39
  • 63