PHP Simple HTML DOM can't read "data-src" or "img src" without http: in path

Question

I'm working with PHP Simple HTML DOM and just discovered it can't read images from data-src attribute or <img src without http: eg; <img src="//static.mysite.com/123.jpg">

Is there any way to make it happen?

My code is:

if($htm->find('img')){
foreach($htm->find('img') as $element) {
        $raw = file_get_contents_curl($element->src);
        $im = @imagecreatefromstring($raw);
        $width = @imagesx($im);
        $height = @imagesy($im);
        if($width>500&&$height>=350){
    $hasimg = '1';
        echo '<img src=\'' .$element->src. '\'>';
        }

} // end foreach
} // end if htm

If site doesn't response without `http` then write it manually in foreach loop — nanobash, Apr 13 '14 at 16:59
Try putting a `die();var_dump($raw);` after `file_get_contents_curl(...);` and verify that that function is working correctly. I would guess you're not getting any errors because of all of the error suppression operators. — Alex W, Apr 13 '14 at 17:01
Its not the site URLs, Its about `http:` in img paths on remote URLs. — wp student, Apr 13 '14 at 17:01
@AlexW The above function is working properly. But it doesn't respond to `` or img tag without `http:` in path. — wp student, Apr 13 '14 at 17:03

score 11 · Answer 1 · answered Apr 14 '14 at 02:02

11

It works for me:

$doc = str_get_html('<img data-src="foo">');
echo $doc->find('img', 0)->getAttribute('data-src');
//=> outputs: foo

answered Apr 14 '14 at 02:02

pguardiario

53,827
19
119
159

score 1 · Answer 2 · edited Mar 15 '22 at 02:14

1

echo $htm->find('img', 0)->getAttribute('data-src');

edited Mar 15 '22 at 02:14

Simas Joneliunas

2,890
20
28
35

answered Mar 11 '22 at 15:06

Ali Emadzadeh

21
4

score 0 · Answer 3 · edited May 23 '17 at 11:59

0

If you're using file_get_contents_curl() as a function you defined in your code, like the one in this question, you need to set the default protocol to use for cURL:

curl_setopt($ch, CURLOPT_PROTOCOLS, CURLPROTO_HTTP);

That way, if the image src attribute has a protocol relative URL, cURL will just use HTTP.

edited May 23 '17 at 11:59

Community

1
1

answered Apr 13 '14 at 17:21

Alex W

37,233
13
109
109

Yes I'm using `file_get_contents_curl()`. I added the above line. It didn't solve it. – wp student Apr 13 '14 at 17:31

Wolfgang Stengel · Answer 4 · 2014-04-13T18:05:24.693

0

Leaving out the protocol (http/https) is called "network path reference" and means that the protocol of the page the URL is embedded in should be used. This makes no sense with file_get_contents() or curl, because they are not aware of any page.

Long story short, you have to add the protocol yourself.

Try this:

$url=$element->src;
if (substr($url, 0, 2)=='//') $url='http:'.$url;
$raw=file_get_contents_curl($url);

edited Apr 13 '14 at 18:05

answered Apr 13 '14 at 17:52

Wolfgang Stengel

2,867
1
17
22

Short story, you didn't understand the question :D – wp student Apr 13 '14 at 17:57
Your question is "Is there a any way to make that happen". The only what to make that happen is to add http or https yourself. What part did I not understand? – Wolfgang Stengel Apr 13 '14 at 17:58
What I get from your answer is adding `http` to source url, which are already being used. What functions isn't doing is, failing to extract urls of images (from data-src tag or img tag without http in image paths) from provided html contents. eg `` – wp student Apr 13 '14 at 18:01
I think that the extraction of the URL from the HTML content works just fine. The problem is that file_get_contents_curl() does not understand URLs without http: in front of it. – Wolfgang Stengel Apr 13 '14 at 18:03
Yes Wolfgang Stengel got it. Plus I tried `file_get_contents_curl($element->data-src);` It didn't work either – wp student Apr 13 '14 at 18:05
Did you test it with a URL that starts with //? – Wolfgang Stengel Apr 13 '14 at 18:05
@Waqas The example URLs you provided [are not valid URLs](http://webmasters.stackexchange.com/questions/8354/what-does-the-double-slash-mean-in-urls). – Alex W Apr 13 '14 at 18:09
They are valid URLs in an HTML/browser context, but not for file_get_contents_curl(). – Wolfgang Stengel Apr 13 '14 at 18:10
Yes I tested `` and `` and `` all are not working. Only `` works. – wp student Apr 13 '14 at 18:13
It seems there are two problems here then. If you use your code needs to be $element->src. The attribute name and the member name need to be equal. Additionally, any kind of server side fetching of URLs will never work with URLs like "//path", you need to add http or https if it's missing. – Wolfgang Stengel Apr 13 '14 at 18:17
Ok they are not valid, then I think its limitation of **Simple_html_dom.php** I'll have to quit on this :D – wp student Apr 13 '14 at 18:18
Or maybe not quit, but just add it yourself. Like everyone is recommending. – Wolfgang Stengel Apr 13 '14 at 18:19
It's not a limitation of the DOM parser. You can var_dump($element->src) to see what comes out. – Wolfgang Stengel Apr 13 '14 at 18:20

PHP Simple HTML DOM can't read "data-src" or "img src" without http: in path

4 Answers4