3

I am executing the following code:

<?php
$html = file_get_contents('http://actualidad.rt.com/actualidad');
var_dump($html);
?>

And the result is more than strange. I have been working with file_get_contents() for a long time. But I have no idea what this can be.

Any help? Thanks a lot, for reading.

huysentruitw
  • 27,376
  • 9
  • 90
  • 133
  • 4
    That's gzipped data. My guess would be that the website is sending the response compressed whether or not they get an appropriate Accept-Encoding header. – Matt Gibson Dec 09 '14 at 19:47
  • Are you using UTF-8? You should be, as this is an international page. – Jonathan M Dec 09 '14 at 19:47
  • What are you expecting to get from this url? Is it a compressed or encoded file? It could be an issue with your headers. – grim Dec 09 '14 at 19:48
  • @JonathanM Web browsers will generally decompress gzipped pages automatically. In fact, many/most sites will send their pages compressed to save bandwidth, and browsers transparently uncompress them. However, sites are only *meant* to send compressed data if an appropriate Accept-Encoding header is transmitted, which is why file_get_contents should work, as it shouldn't send the header. The site is misconfigured, I'd say, but all modern browsers will just cope with that. – Matt Gibson Dec 09 '14 at 19:50
  • @JonathanM The web server is compressing the data. If you look at the header you will see `Content-Encoding: gzip` – Machavity Dec 09 '14 at 19:51
  • @Machavity Do you know any possible way of getting the original HTML? Thanks a lot for sharing your knowledge on here. – Niccolas Ray Dec 09 '14 at 19:53
  • @MattGibson, thanks, man. I learned something today. :) – Jonathan M Dec 09 '14 at 19:53

1 Answers1

4

The site is technically broken. It's sending the page back gzip-encoded whether or not the client has indicated that it can cope with that. This works in all modern web browsers, as they either request the page compressed by default, or they cope with a gzipped response even if they didn't ask for one.

You could go down the route suggested in the answer to the question that Wouter points out, but I'd suggest using PHP's curl library instead. That should be able to decode the requested page transparently.

For example:

$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, 'http://actualidad.rt.com/actualidad');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_ENCODING , 'gzip');
echo curl_exec ($ch);

You should find that this outputs the actual HTML of the web page. This is because of the CURLOPT_ENCODING option that I've set to "gzip". Because I've set that, curl knows that the response will be gzipped, and will decompress it for you.

I think this is a better solution than unzipping the page manually, as in the future, if the site is fixed so that it does sensibly return a non-gzipped page if the client says that it can't cope with gzip, this code should carry on working.

Community
  • 1
  • 1
Matt Gibson
  • 37,886
  • 9
  • 99
  • 128