Tried to crawl this ny times article with Curl: Article
the function
function get_content_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
He fails to crawl the article redirecting to a login page and crawl the login page but not the article. why?
how to prevent redirecting? i also tried CURLOPT_FOLLOWLOCATION, false
but dosent work. how to fix this?
Answer of my own question:
Added those 2 lines for creating and reading the cookies and it works.
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt');