1

I'm trying to do some HTML DOM parsing. The parsing I am doing is dependent on the URI of the page. The problem is that when I load an HTML file like in the following:

// Creat HTML DOM
$dom_document = new DOMDocument();
@$dom_document->loadHTMLFile('http://www.google.com/');

I am sometimes redirected by the site (e.g. Google may redirect me to a country specific domain). Questions:

  1. How do I prevent being redirected? I want to explicitly state which page I want to parse -- and not be sent to another page. I don't need to use DOMDocument.
  2. If there is no way to prevent being redirected, is there at least a way to know what the URI I was sent to?

EDIT 1:

function get_html_content($url)
        {
            $ch      = curl_init();

            curl_setopt($ch, CURLOPT_ENCODING, 'gzip');
            curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
            curl_setopt($ch, CURLOPT_FOLLOWLOCATION, FALSE); // not good for 301 redirects
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
            curl_setopt($ch, CURLOPT_URL, $url);

            $data = curl_exec($ch);

            // Check if any error occured
            if(curl_errno($ch))
            {
                echo 'Curl error: ' . curl_error($ch);
                assert(FALSE);
                die();
            }

            curl_close($ch);

            return $data;
        }
hakre
  • 193,403
  • 52
  • 435
  • 836
StackOverflowNewbie
  • 39,403
  • 111
  • 277
  • 441

1 Answers1

0

The answer is "yes" on both counts, but not using loadHTMLFile().

If you can, use curl. It provides much more detailed control over redirections.

Fetch the contents with it, and import them to your DOMDocument using loadHTML().

See e.g.

Community
  • 1
  • 1
Pekka
  • 442,112
  • 142
  • 972
  • 1,088
  • please see Edit 1. I have decided I want to prevent the redirect. So, I set FOLLOWLOCATION to FALSE. The problem now is that when I try to go to http://www.yahoo.com, it tries to redirect me, but CURL prevents it, and now I get nothing. Do you see what's wrong with my code? – StackOverflowNewbie Dec 04 '10 at 14:40
  • this is what I get: string(80) " " – StackOverflowNewbie Dec 04 '10 at 14:42
  • @StackOverflowNewbie what do you expect to get? If you want the .com page, best access the site using your browser and look what address explicitly gets you there. You may need to set a cookie or some specific parameters to get the .com page. – Pekka Dec 04 '10 at 15:08