How to scrape relative images

Question

If i look at Amazon Button to add items to lists on their site - you can see it here:

http://www.amazon.co.uk/wishlist/get-button

How does it work? I'm pretty sure it scrapes the page somehow but it seems to get every image whether its a flash image, jpg or anything, even when the site in question uses relative img src as opposed to absolute full site urls

Example page below, all images shown are jpg which is cool but all img src are relative meaning no "http://blah.com" before them

http://gadgets.guardianoffers.co.uk/p-788-Casio-Solar-Powered-Edifice-Watch.html

Is there a better way to get images other than parsing the html source?

Or are they just doing a million ifs if they don't get a hit straight away?

[That's the script](https://www.amazon.co.uk/wishlist/add.js?loc=http://gadgets.guardianoffers.co.uk/p-788-Casio-Solar-Powered-Edifice-Watch.html&b=AUWLBookenGB) which is loaded by clicking on the bookmarklet. Have fun reading/learning ;) — Andreas, Aug 24 '12 at 16:39

score 0 · Answer 1 · answered Aug 24 '12 at 16:43

It looks like it parses the HTML of the page and looks for what is semantically identified as the primary image, name and price. For example, if you look at a page that doesn't have any ecommerce products, for example: http://www.theglobeandmail.com/ it takes the page h1 element as the product name and the primary image (front page story image) as the product image.

So behind the scenes they are doing a lot of guessing. Using HTML 5 semantic markup, you could establish a standard for this kind of thing, but unless everyone is using it, you are just making educated guesses.

How to scrape relative images

1 Answers1