0

I've developed an image scraper that will scrape specific images from remote sites and display them upon pasting into a text field. The logic includes finding images that end in .jpg .jpeg . png etc.

I'm running into an issue where alot of sites will generate images via javascript and or not have the image extension as part of the displayed image. Example sites like

www.express.com and www.underarmour.com have this issue and many more.

What function could I use to find images from a set URL and then display them accordingly that do not have a file extension?

Thanks again.

Chris Favaloro
  • 79
  • 1
  • 12
  • AFAIK if you don't have the file extension, you can't just "guess" and append an extension. – Matt Aug 01 '12 at 19:48
  • 1
    if you don't have permissions from those sites, you cant do this. The TOS on both named sites are quite clear. –  Aug 01 '12 at 19:49
  • 1
    Look for `img` tags instead of extensions – Steve Robbins Aug 01 '12 at 19:50
  • I'm clueless about your exact problem. In HTTP, file names are irrelevant since we have the `Content-Type` header. I don't think anyone actually *generates* images with client-side JavaScript :-? – Álvaro González Aug 01 '12 at 20:00

2 Answers2

1

unless the url comes from <img src="...">, there is NO way to tell what you'll get from a particular url. http://example.com/index.html could very well actually be a PHP script that serves up a zip file.

It is IMPOSSIBLE to reliably tell what a url will give you until you actually hit the url and check the headers + downloaded data.

Marc B
  • 356,200
  • 43
  • 426
  • 500
  • Essentially think of the script as the way a facebook share functions. It takes the link and generates a thumb preview of the image from the site. So the script in itself would be scouring the HTML of the site. – Chris Favaloro Aug 01 '12 at 19:54
  • most likely FB's only pulling urls out of img tags, and not following every wonky url on a page on the offchance it's pointing at a picture. – Marc B Aug 01 '12 at 19:57
1

I think, you have two options:

  1. Generate some heuristics, whether a URL could be an image (like finding a part /images/ in the URL)

  2. Load every URL and check, whether the returned data is an image (using for example getimagesize())

The second version is more generalized, but quite heavy on both bandwidth and resources.

apfelbox
  • 2,625
  • 2
  • 24
  • 26
  • getimagesize downloads the whole url before doings its thing. No biggie if you're actually pointing it a what turns out to be a 200 big .gif icon. very big ugly deal if that weird link turns out to be a 4gigabyte iso image. – Marc B Aug 01 '12 at 19:54
  • I agree, I'm doing get image size at the moment however It's only to images that have a file extension. The big issue is that some of these sites are printing the images via javascript. – Chris Favaloro Aug 01 '12 at 19:55
  • @ Marc B: That is correct. But you could also preload the data and call `getimagesize()` on the local data. You could use for example cURL to just get the headers of the response (`HEAD` request), decide whether it is an image (and look at the file size, which should be in the headers too) and load it then. But a) these headers could be false and b) you are issuing a ton of request if you do it this way. – apfelbox Aug 01 '12 at 19:57
  • @ Chris Favaloro: "printing via javascript" = loading it dynamically in the DOM or actually printing the image with JavaScript in ``, SVG, etc..? – apfelbox Aug 01 '12 at 19:58