2

I am trying to get the data-src and the data-srcset attributes from a string of many images in php. Both attributes are optional, that means, there can be zero, only data-src, only data-srcset or both. The regex I have is

<img(.*?)data-src=['\"](.*?)['\"].*?|(data-srcset=['\"](.*?)['\"])?\/>

The string i am testing against is:

<li class="blocks-gallery-item">
  <figure>
    <img data-src="http://localhost:3000/wp-content/uploads/2018/11/detektivhut.gif" alt="" data-id="1037" data-link="http://localhost:3000/detektivhut/" class="wp-image-1037"/>
  </figure>
</li>
<li class="blocks-gallery-item">
  <figure>
    <img data-src="http://localhost:3000/wp-content/uploads/2018/11/DSC04828.png" alt="" data-id="948" data-link="http://localhost:3000/dsc04828-2/" class="wp-image-948" data-srcset="//localhost:3000/wp-content/uploads/2018/11/DSC04828.png 1067w, //localhost:3000/wp-content/uploads/2018/11/DSC04828-200x300.png 200w, //localhost:3000/wp-content/uploads/2018/11/DSC04828-768x1152.png 768w, //localhost:3000/wp-content/uploads/2018/11/DSC04828-683x1024.png 683w, //localhost:3000/wp-content/uploads/2018/11/DSC04828-1000x1500.png 1000w" sizes="(max-width: 1067px) 100vw, 1067px" />
  </figure>
</li>
<li class="blocks-gallery-item">
  <figure>
    <img data-src="http://localhost:3000/wp-content/uploads/2018/11/DSC04831.png" alt="" data-id="883" data-link="http://localhost:3000/2018/11/13/single-page-style-1/dsc04831-2/" class="wp-image-883" data-srcset="//localhost:3000/wp-content/uploads/2018/11/DSC04831.png 1067w, //localhost:3000/wp-content/uploads/2018/11/DSC04831-200x300.png 200w, //localhost:3000/wp-content/uploads/2018/11/DSC04831-768x1152.png 768w, //localhost:3000/wp-content/uploads/2018/11/DSC04831-683x1024.png 683w, //localhost:3000/wp-content/uploads/2018/11/DSC04831-1000x1500.png 1000w" sizes="(max-width: 1067px) 100vw, 1067px" />
  </figure>
</li>

But it is too greedy. Look here:

https://regex101.com/r/vDQE3C/1

Any help (also logical) is very much appreciated.

niklas
  • 2,887
  • 3
  • 38
  • 70

2 Answers2

1

Don't use regex for parsing html code. Better to use DOM parser like this:

$html = <<< EOF
<li class="blocks-gallery-item">
  <figure>
    <img data-src="http://localhost:3000/wp-content/uploads/2018/11/detektivhut.gif" alt="" data-id="1037" data-link="http://localhost:3000/detektivhut/" class="wp-image-1037"/>
  </figure>
</li>
<li class="blocks-gallery-item">
  <figure>
    <img data-src="http://localhost:3000/wp-content/uploads/2018/11/DSC04828.png" alt="" data-id="948" data-link="http://localhost:3000/dsc04828-2/" class="wp-image-948" data-srcset="//localhost:3000/wp-content/uploads/2018/11/DSC04828.png 1067w, //localhost:3000/wp-content/uploads/2018/11/DSC04828-200x300.png 200w, //localhost:3000/wp-content/uploads/2018/11/DSC04828-768x1152.png 768w, //localhost:3000/wp-content/uploads/2018/11/DSC04828-683x1024.png 683w, //localhost:3000/wp-content/uploads/2018/11/DSC04828-1000x1500.png 1000w" sizes="(max-width: 1067px) 100vw, 1067px" />
  </figure>
</li>
<li class="blocks-gallery-item">
  <figure>
    <img data-src="http://localhost:3000/wp-content/uploads/2018/11/DSC04831.png" alt="" data-id="883" data-link="http://localhost:3000/2018/11/13/single-page-style-1/dsc04831-2/" class="wp-image-883" data-srcset="//localhost:3000/wp-content/uploads/2018/11/DSC04831.png 1067w, //localhost:3000/wp-content/uploads/2018/11/DSC04831-200x300.png 200w, //localhost:3000/wp-content/uploads/2018/11/DSC04831-768x1152.png 768w, //localhost:3000/wp-content/uploads/2018/11/DSC04831-683x1024.png 683w, //localhost:3000/wp-content/uploads/2018/11/DSC04831-1000x1500.png 1000w" sizes="(max-width: 1067px) 100vw, 1067px" />
  </figure>
</li>
EOF;

$xpath = new DOMXPath(@DOMDocument::loadHTML($html));
$images = $xpath->evaluate("//img");

foreach($images as $img){
   if (($el = $img->attributes->getNamedItem('data-src')) != null)
      echo 'data-src=' . $el->nodeValue . "\n";
   if (($el = $img->attributes->getNamedItem('data-srcset')) != null)
      echo 'data-srcset=' . $el->nodeValue . "\n";
}

Output:

data-src=http://localhost:3000/wp-content/uploads/2018/11/detektivhut.gif
data-src=http://localhost:3000/wp-content/uploads/2018/11/DSC04828.png
data-srcset=//localhost:3000/wp-content/uploads/2018/11/DSC04828.png 1067w, //localhost:3000/wp-content/uploads/2018/11/DSC04828-200x300.png 200w, //localhost:3000/wp-content/uploads/2018/11/DSC04828-768x1152.png 768w, //localhost:3000/wp-content/uploads/2018/11/DSC04828-683x1024.png 683w, //localhost:3000/wp-content/uploads/2018/11/DSC04828-1000x1500.png 1000w
data-src=http://localhost:3000/wp-content/uploads/2018/11/DSC04831.png
data-srcset=//localhost:3000/wp-content/uploads/2018/11/DSC04831.png 1067w, //localhost:3000/wp-content/uploads/2018/11/DSC04831-200x300.png 200w, //localhost:3000/wp-content/uploads/2018/11/DSC04831-768x1152.png 768w, //localhost:3000/wp-content/uploads/2018/11/DSC04831-683x1024.png 683w, //localhost:3000/wp-content/uploads/2018/11/DSC04831-1000x1500.png 1000w
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • Thanks. How could I change the attributes? Let's say value of `data-src` to value of `src` – niklas Dec 04 '18 at 15:39
  • 1
    Just pass your attribute name in `getNamedItem` function so `$img->attributes->getNamedItem('src')` – anubhava Dec 04 '18 at 15:40
  • i mean: set the value of data-src to that of src so that e.g. `` becomes `` – niklas Dec 06 '18 at 09:57
  • You may check this answer: https://stackoverflow.com/questions/11387748/change-tag-attribute-value-with-php-domdocument Basically you need to call `$img->setAttribute('src', $el->nodeValue);` after first `echo` – anubhava Dec 06 '18 at 10:01
0

You just need to account for anything that comes between the data-attributes* and the image closing tags />. You needed another (.*?).

<img(.*?)data-src=['\"](.*?)['\"].*?data-srcset=['\"](.*?)['\"](.*?)\/>

And if you only want to capture the data-attributes* consider using non-capturing groups like follows. So that the $1 and $2 variables contain only the data you want, and not the whole image tag.

<img(?:.*?)data-src=['\"](.*?)['\"].*?data-srcset=['\"](.*?)['\"](?:.*?)\/>

SubXaero
  • 367
  • 3
  • 8