0

i want, with a regex, find all img tag into html document and extract the content of the src attribute.

This is my regex (see online https://regex101.com/r/EE08dw/1):

<img(?<prepend>[^>]+?)src=('|")?(?<src>[^\2>]+)[\2]?(?<append>[^>]*)>

On a test string:

<img src="aaa.jpg">

the output is:

Full match    `<img src="aaa.jpg">`
Group prepend ` `
Group 2.      "
Group srs     `aaa.jpg"`
Group append  ``

but the expected output is:

Full match    `<img src="aaa.jpg">`
Group prepend ` `
Group 2.      "
Group srs     `aaa.jpg`
Group append  ``

the problem is into group src that also match the " char:

Output:   Group srs `aaa.jpg"`
Expected: Group srs `aaa.jpg`

how fix it?

side note: regex is safe in my context

Simone Nigro
  • 4,717
  • 2
  • 37
  • 72
  • 2
    [H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) - use a parser – ctwheels Jan 25 '18 at 19:41
  • @ctwheels What is with the image on your comment? –  Jan 25 '18 at 19:47
  • Lots of diacritical marks and stuff on the bottom of the text. all accents i think. –  Jan 25 '18 at 19:52

3 Answers3

4

Since you specified in the comments below your question that using regex in your case is safe...

You can't put backreferences in a set. It'll interpret the characters literally (so in your case \2 matches the character with index 28 literally). Use a tempered greedy token instead.

See regex in use here

<img(?<prepend>[^>]+?)src=(['"])?(?<src>(?:(?!\2)[^>])+)\2?(?<append>[^>]*)>
                          ^^^^^^        ^^^^^^^^^^^^^^  ^^
                          1             2               3
1: Uses set - you can do an or | as well, but using a set improves performance
2: Tempered greedy token
3: Take backreference out of set
ctwheels
  • 21,901
  • 9
  • 42
  • 77
2
function getAllSrc(){
var arr=document.getElementsByTagName("IMG")
var srcs=[]
for(var i = 0; i<arr.length;i++){
srcs=srcs.concat(arr[i])
}
return srcs
}
0

if you use php , try this code :

$thehtml = '<p>lol&nbsp;</p><p><img src="" data-filename="LOGO80x80.png" style="width: 25%;"></p><p>hhhhh</p><p><img src="https://avatars2.githubusercontent1.com/u/12745270?s=52&amp;v=4" alt="lol" style="width: 25%;"><br></p>';


function getImgFromPost($html){
    preg_match_all('/<img[^>]+>/i',$html, $result); 
    $img = array();
    $i = 0;
    foreach( $result[0] as $img_tag)
    {
        preg_match_all('/(src)="([^"]+)"/i',$img_tag, $img[$i]);
        $i++;
    }

    $arr0 = array();
    for ($x0 = 0; $x0 < count($img); $x0++) {
        for($x1 = 0;$x1 < count($img[$x0][1]); $x1++){
            $arr0[$x0][$img[0][1][$x1]] = $img[$x0][2][$x1];
        }
    }
    return $arr0;
}

the output will be like this :

Array
(
    [0] => Array
        (
            [src] => 
        )

    [1] => Array
        (
            [src] => https://avatars2.githubusercontent1.com/u/12745270?s=52&amp;v=4
        )

)
A. El-zahaby
  • 1,130
  • 11
  • 32