I'm trying to do some HTML DOM parsing. The parsing I am doing is dependent on the URI of the page. The problem is that when I load an HTML file like in the following:
// Creat HTML DOM
$dom_document = new DOMDocument();
@$dom_document->loadHTMLFile('http://www.google.com/');
I am sometimes redirected by the site (e.g. Google may redirect me to a country specific domain). Questions:
- How do I prevent being redirected? I want to explicitly state which page I want to parse -- and not be sent to another page. I don't need to use DOMDocument.
- If there is no way to prevent being redirected, is there at least a way to know what the URI I was sent to?
EDIT 1:
function get_html_content($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_ENCODING, 'gzip');
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, FALSE); // not good for 301 redirects
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_URL, $url);
$data = curl_exec($ch);
// Check if any error occured
if(curl_errno($ch))
{
echo 'Curl error: ' . curl_error($ch);
assert(FALSE);
die();
}
curl_close($ch);
return $data;
}