3

Ok, so what I am looking for is somewhat similar like the code below which is very dummy and not working for some reason which I totally don't care about now (please read the question under the code!!):

$url = urldecode($_GET["link"]);
$port = (preg_match("/^https\:\/\//", $url) > 0 ? 443 : 80);

$headers  = "GET / HTTP/1.1\r\n";
$headers .= "Host: $url";
$headers .= "Accept-Charset: ISO-8859-2,utf-8;q=0.7,*;q=0.3\r\n";
$headers .= "Accept-Encoding: gzip,deflate,sdch\r\n";
$headers .= "Accept-Language: hu-HU,hu;q=0.8,en-US;q=0.6,en;q=0.4\r\n";
$headers .= "Cache-Control: no-cache\r\n";
$headers .= "Connection: keep-alive\r\n";
$headers .= "User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.52 Safari/536.5\r\n\r\n";
//yea, I'm using Google Chrome's userAgent

$socket = @fsockopen($url, $port) or die("Could not connect to $url");

if ($socket) {

    fwrite($socket, $headers);

    while (!feof($socket)) {
        echo fgets($socket, 128);
    }

    fclose($socket);
}

As you can see, what I am trying to achieve is to somehow fetch the html or any other output from the url give in the GET global. Again, the code is not working and I don't care, I don't need code correction, I need infos/guidance.

Now. I am not a PHP guru so the question is somewhat complex:

  • what options do I have to achieve the above mentioned need?
  • what do I have to take care of before/after doing that specific method?
  • any dependecnies (library)?
  • pros/kontras/previous experiences?

Also I am very thankful if you answer with just a bunch of links, I'm not exactly looking for a droid answer like "this is the most sacred and only way you should do!", I am more about gathering infos and options, knowledge. =)

I have no idea whether this matters or not (like for the driver for MongoDB): I am using WAMP Server currently on a Windows 7 x64 and later I plan to move it under my CentOS 6.2 webserver so please also consider these also (might have dependencies on Linux).

hakre
  • 193,403
  • 52
  • 435
  • 836
benqus
  • 1,119
  • 2
  • 10
  • 24
  • 2
    Check out [cURL](http://php.net/curl) for starters – drew010 May 29 '12 at 22:10
  • 1
    or [file_get_contents](http://php.net/file_get_contents) – lsl May 29 '12 at 22:15
  • 1
    I should also note, hackers will love you if you put this up on the public web.. they will be effectively able to use your server as a nice little proxy. – lsl May 29 '12 at 22:19
  • Haha, that's true buddy. =) Thanks for mentioning that! =) I will be aware of that too. =) At first I want to experiment a little with this thingy, it might be very important to one of my home projects, only (as mentioned) I'm not a PHP guru. =) – benqus May 29 '12 at 22:21

2 Answers2

3

You have a couple of options if you want to change useragent and fetch the page content:

First and best IMO is curl, 99.9% of hosts have this enabled, if its your own vps ect then its easy to setup http://bit.ly/KUn3AS:

<?php 
function curl_get($url){
    if (!function_exists('curl_init')){
        die('Sorry cURL is not installed!');
    }
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 10);
    $output = curl_exec($ch);
    curl_close($ch);
    return $output;
}
?>

Secondly is with file_get_contents with a custom stream context:

<?php
function fgc_get($url) {
    $opts = array(
      'http'=>array(
        'method'=>"GET",
        'header'=>"Accept-language: en\r\n" .
                  "Cookie: foo=bar\r\n" .
                  "User-Agent: MozillaXYZ/1.0\r\n"
      )
    );
    $context = stream_context_create($opts);
    $urlContents = file_get_contents($url, false, $context);
    return file_get_contents($url, false, $context);
}
?>

Which ever method you choose if your accepting an arbitrary url from a user inputted $_GET then your open to abuse in some cases, if your looking to make a proxy for your sites AJAX requests then you can add some security in place like only allowing specific domains, or checking if its an xmlhttprequest/AJAX request ect before doing any external scrap, tho you could just leave it open its your choice:

<?php 
if(!empty($_GET['url']) && !empty($_SERVER['HTTP_X_REQUESTED_WITH']) && strtolower($_SERVER['HTTP_X_REQUESTED_WITH']) == 'xmlhttprequest') {

    $allowed = array('somesite.com','someothersite.com');

    $url = parse_url($_GET['url']);

    if(in_array($url['host'],$allowed)){
        echo curl_get($_GET['url']);
    }
    die;
}
?>
Lawrence Cherone
  • 46,049
  • 7
  • 62
  • 106
  • 1
    Darn. I wanted to provide this answer. :) – ghoti May 30 '12 at 00:05
  • 1
    @benqus curl hands down, faster then fgc and much more configurable, plus its non blocking and has an option to grab multiple urls at once with [curl_multi](http://www.php.net/manual/en/function.curl-multi-init.php) – Lawrence Cherone May 30 '12 at 09:55
0

SIMPLE WAY TO GET CONTENT FROM URL

1) first method

Enable Allow_url_include at your hosting (php.ini or somewhere)

<?php
$variablee = readfile("http://example.com/");
echo $variablee;
?> 

or

2)second method

Enable php_curl, php_imap, php_openssl

<?php
// you can add anoother curl options too
// see here - http://php.net/manual/en/function.curl-setopt.php
function get_data($url) {
  $ch = curl_init();
  $timeout = 5;
  curl_setopt($ch, CURLOPT_URL, $url);
  curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)");
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  curl_setopt($ch, CURLOPT_SSL_VERIFYHOST,false);
  curl_setopt($ch, CURLOPT_SSL_VERIFYPEER,false);
  curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
  curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
  $data = curl_exec($ch);
  curl_close($ch);
  return $data;
}

$variablee = get_data('http://example.com');
echo $variablee;
?>
T.Todua
  • 53,146
  • 19
  • 236
  • 237