1

I'm having a little trouble with this pattern - "/([a-z-_0-9/\:.]*.(jpg|jpeg|png))/i" - within the preg_match_all function. Admittedly, my regex is a little weak so I suspect something is wrong within there.

Here's what I have at the moment -

preg_match_all("/([a-z\-_0-9\/\:\.]*\.(jpg|jpeg|png))/i", $raw, $matching)

With $raw being just the HTML from this page - http://www.topshop.com/webapp/wcs/stores/servlet/ProductDisplay?beginIndex=0&viewAllFlag=&catalogId=33057&storeId=12556&productId=13936776&langId=-1&categoryId=&parent_category_rn=&searchTerm=TS05K01FBLC&resultCount=1&geoip=home

There are a bunch of images on the page that aren't being pulled in, all I'm getting is the following ([0] of the $matching array - the rest is repeat data in a different format)

array(8) {
    [0]=>
    string(77) "http://media.topshop.com/wcsstore/TopShop/images/catalog/05K01FBLC_normal.jpg"
    [1]=>
    string(143) "/wcsstore/ConsumerDirectStorefrontAssetStore/images/colors/color7/cms/pages/static/static-0000067510/images/tact-wk24-LFWshipping_UK-ROW-EU.jpg"
    [2]=>
    string(76) "http://media.topshop.com/wcsstore/TopShop/images/catalog/05K01FBLC_large.jpg"
    [3]=>
    string(77) "http://media.topshop.com/wcsstore/TopShop/images/catalog/05K01FBLC_normal.jpg"
    [4]=>
    string(40) "//assets.pinterest.com/images/PinExt.png"
    [5]=>
    string(41) "http://platform.tumblr.com/v1/share_4.png"
    [6]=>
    string(163) "http://media.topshop.com/wcsstore/ConsumerDirectStorefrontAssetStore/images/colors/color7/cms/pages/static/static-0000067528/images/PDP-wk24-LFWshipping_ROW-EU.jpg"
    [7]=>
    string(119) "/wcsstore/ConsumerDirectStorefrontAssetStore/images/colors/color7/cms/pages/static/static-0000008560/images/onthego.png"
  }

If anyone could give me a little information as to why this isn't pulling in every image on the page and just these 8 images?

Is there something in the regular expression thats limiting what I get?

I'm not getting this jpg link - http://media.topshop.com/wcsstore/TopShop/images/catalog/05K01FBLC_3_large.jpg - Even though it's on the page

Any help would be most appreciated.

Greg

lionysis
  • 237
  • 2
  • 4
  • 15
  • 3
    regular expressions are a bad way to parse html, try using `DOMDocument` instead -> http://stackoverflow.com/questions/15895773/scraping-all-images-from-a-website-using-domdocument – Crisp Feb 18 '14 at 00:52
  • that jpg isn't on page anymore. I got 25 images with next regexp: `preg_match_all("/(?<='|\")[^'\"]+(jpg|jpeg|png)(?='|\"|\?)/i", $raw, $matching);` gif skipped with the purpose? – Michael Livach Feb 18 '14 at 00:56
  • [Your regex is fine mostly](http://regex101.com/r/bM2cF2) — it basically comes down to what Crisp mentioned. – l'L'l Feb 18 '14 at 00:59
  • Yep, your regex seems to be working. It picks up 25 images, and if you add .gif it picks up 36 images. – Bryan Elliott Feb 18 '14 at 01:04
  • Hi All, I'm skipping Gif's on purpose - just because the relevant images are jpg's. So then perhaps the preg_match_all isn't pulling in the correct amount. I will have a look into DOMDocument - but doesn't this just allow you to find find IMG tags? I want all URL's that point to the image extensions mentioned. Thanks for the help. :-) – lionysis Feb 18 '14 at 02:32
  • I've tried the DOMDocument example and it doesn't seem to work - I don't get anything back from the given page. – lionysis Feb 18 '14 at 03:17

1 Answers1

0

I used this and also got 25 images from the page as MElliott put in the comments that he got from yours.

preg_match_all('/([-a-z0-9_\/:.]+\.(jpg|jpeg|png))/i', $raw, $matches);

print "<pre>"; print_r($matches[0]); print "</pre>";

Only things I'd mention is that you don't need to escape all of the characters in the character class - only the forward slash since it is the delimiter you are using. Also, you should use the plus sign + instead of the asterisk * after your character class to make sure at least one character is in your image name.

Quixrick
  • 3,190
  • 1
  • 14
  • 17