I am trying to get some information from a UK retailer's website, and on many of the scrapes I have done it's quite simple. However, there are a number where I just cannot get around issues where most of the time they are caused by cookies. This was a a great SO question, but it's not helped.
I have the following PHP function...
function file_get_contents_curl_many_redir2( $url, $timeout = 15 ) {
$cookie = tempnam ("/tmp", "CURLCOOKIE");
$verbose = fopen('php://temp', 'rw+');
$ch = curl_init();
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0" );
curl_setopt( $ch, CURLOPT_COOKIEJAR, $cookie );
curl_setopt( $ch, CURLOPT_COOKIEFILE, $cookie ); //
curl_setopt( $ch, CURLOPT_COOKIESESSION, true );
curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false); //
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $ch, CURLOPT_ENCODING, "" );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_MAXREDIRS, 100 );
curl_setopt( $ch, CURLOPT_VERBOSE, true);
curl_setopt( $ch, CURLOPT_STDERR, $verbose);
$data = curl_exec($ch);
rewind($verbose);
$verboseLog = stream_get_contents($verbose);
echo "Verbose information:\n<pre>", htmlspecialchars($verboseLog), "</pre>\n";
$curlVersion = curl_version();
extract(curl_getinfo($ch));
$metrics = <<<EOD
URL....: $url
Code...: $http_code ($redirect_count redirect(s) in $redirect_time secs)
Content: $content_type Size: $download_content_length (Own: $size_download) Filetime: $filetime
Time...: $total_time Start @ $starttransfer_time (DNS: $namelookup_time Connect: $connect_time Request: $pretransfer_time)
Speed..: Down: $speed_download (avg.) Up: $speed_upload (avg.)
Curl...: v{$curlVersion['version']}
EOD;
var_dump($metrics);
if(curl_errno($ch)){
echo 'Curl error: ' . curl_error($ch) . "on url:" .$url;
var_dump(curl_getinfo($ch, CURLINFO_HTTP_CODE));
}
curl_close($ch);
return $data;
}
On the majority of websites this works (I in fact have simpler functions as most don't redirect lots of times.)
But with www.homebase.co.uk when I use this url http://www.homebase.co.uk/SearchDisplay?pageSize=43&searchSource=Q&resultCatEntryType=2&pageView=&catalogId=10011&showResultsPage=true&beginIndex=0&langId=110&categoryId=&storeId=10201&sType=SimpleSearch&searchTerm=$sku
where $sku
is the 6 number SKU that Homebase uses (which is also the same as the last 6 numbers of any product page) I get no information.
When I run the URL through Redurect Detective it seems to be an unsupported browser issue, I presume because it's not a browser accessing the webite, and it knows this.
The Main Question:
How do I fix it so I can then get the correct source code to this page? (I am currently just getting blank text, not the product page(s) I want)
Related questions
There's a handful of other websites that don't "let me in" either, is this just cookie related, or is the coding of their website "smart enough" to know whether it's a browser or not? Can I fool it into thinking (this may be answered via my primary question).
Related extra information that may help answer the above.
* When I use CURLOPT_MAXREDIRS
to 100, it seems to only let me go up to 40. Could this be the issue?
* using $sku = 323526;
you should arrive at the final URL after some redirects to http://www.homebase.co.uk/en/homebaseuk/sovereign-petrol-self-propelled-rotary-mower---1493cc---40cm-323526
. I do not need to know where the CURL ends up, I just want to be able to pinch the title, image and other info from the product page from knowing the SKU!