0

How do I download all images from a web page and prefix the image names with the web page's URL (all symbols replaced with underscores)?

For example, if I were to download all images from http://www.amazon.com/gp/product/B0029KH944/, then the main product image would be saved using this filename:

www_amazon_com_gp_product_B0029KH944_41RaFZ6S-0L._SL500_AA300_.jpg

I have installed WinHTTrack and wget and spent more time than it's probably worth to get them to do what I wanted, but I was not successful, so Stack Overflow is my last ditch effort. (WinHTTrack came close if you set the build option to save files according to site structure and write a script to rename files based on their parent directories, but the problem is that the main image is hosted on a different domain.)

thdoan
  • 18,421
  • 1
  • 62
  • 57

1 Answers1

1

Well, I added a download option to my new Webscraper.

With that you can do it like this:

 xidel "http://www.amazon.com/dp/B0029KH944/" -e 'site:=translate(filter($_url, "http://(.*)", 1), "/.", "__")'  -f //img -e 'image:=filter($_url, ".*/(.*)", 1)' --download '$site;$image;'

First -e reads the url and removes the /. characters, the -f selects all imgs, the -e second reads the filenames and --download downloads it then...

ALthough it has the disadvantage that it tries to parse every image as an html file, which could slow it down a little bit...

BeniBela
  • 16,412
  • 4
  • 45
  • 52
  • Hi BeniBela, I just downloaded xidel and ran the command you provided; however, it produced the following error: "Error Unknown option: ., (when reading argument: .,)" – thdoan Sep 14 '12 at 07:29
  • 1
    Are you using Windows or Linux? On Windows, it does not support '-single quotes at the outer level, and you need to swap the ' with the "-quotes. And I changed the default variable names, after posting the answer: the two $_url variables should now be replace by the simpler $url – BeniBela Sep 14 '12 at 09:05
  • After pouring over all the docs, and through much trial and error, I think I have finally gotten a grip on your wonderful and extremely flexible scraper, BeniBela :-). Here is the final command that does exactly what I wanted in my question: `xidel http://www.amazon.com/dp/B0029KH944/ -e "site:=fn:replace(filter($url, 'http://(.+)', 1), '\W', '_')" -f "//img[@id='prodImage']" -e "image:=filter($url, '.+/(.+)', 1)" --download "$site;$image;"` The – thdoan Sep 17 '12 at 10:24
  • **Explanation:** I'm using the XPath 2.0 'replace' function to replace all symbols with underscores, using the URL with "http://" stripped as the input, saving everything into the 'site' variable; then I extract all images with id="prodImage" (main product image), saving just the image name to the 'image' variable; finally, I download the image using the concatenated string $site;$image; as the image's filename. Mission accomplished! – thdoan Sep 17 '12 at 10:35