0

I am trying to get some information from a UK retailer's website, and on many of the scrapes I have done it's quite simple. However, there are a number where I just cannot get around issues where most of the time they are caused by cookies. This was a a great SO question, but it's not helped.

I have the following PHP function...

function file_get_contents_curl_many_redir2( $url, $timeout = 15 ) {

  $cookie = tempnam ("/tmp", "CURLCOOKIE");
  $verbose = fopen('php://temp', 'rw+');

  $ch = curl_init();
  curl_setopt( $ch, CURLOPT_URL, $url );
  curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0" );
  curl_setopt( $ch, CURLOPT_COOKIEJAR, $cookie );
  curl_setopt( $ch, CURLOPT_COOKIEFILE, $cookie ); //
  curl_setopt( $ch, CURLOPT_COOKIESESSION, true );
  curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false); //
  curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
  curl_setopt( $ch, CURLOPT_ENCODING, "" );
  curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
  curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
  curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout );
  curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout );
  curl_setopt( $ch, CURLOPT_MAXREDIRS, 100 );

  curl_setopt( $ch, CURLOPT_VERBOSE, true);
  curl_setopt( $ch, CURLOPT_STDERR, $verbose);

  $data = curl_exec($ch);

  rewind($verbose);
  $verboseLog = stream_get_contents($verbose);
  echo "Verbose information:\n<pre>", htmlspecialchars($verboseLog), "</pre>\n";

  $curlVersion = curl_version();
  extract(curl_getinfo($ch));
  $metrics = <<<EOD
URL....: $url
Code...: $http_code ($redirect_count redirect(s) in $redirect_time secs)
Content: $content_type Size: $download_content_length (Own: $size_download) Filetime: $filetime
Time...: $total_time Start @ $starttransfer_time (DNS: $namelookup_time Connect: $connect_time Request: $pretransfer_time)
Speed..: Down: $speed_download (avg.) Up: $speed_upload (avg.)
Curl...: v{$curlVersion['version']}
EOD;
  var_dump($metrics);

  if(curl_errno($ch)){
    echo 'Curl error: ' . curl_error($ch) . "on url:" .$url;
    var_dump(curl_getinfo($ch, CURLINFO_HTTP_CODE));
  }
  curl_close($ch);

  return $data;
}

On the majority of websites this works (I in fact have simpler functions as most don't redirect lots of times.)

But with www.homebase.co.uk when I use this url http://www.homebase.co.uk/SearchDisplay?pageSize=43&searchSource=Q&resultCatEntryType=2&pageView=&catalogId=10011&showResultsPage=true&beginIndex=0&langId=110&categoryId=&storeId=10201&sType=SimpleSearch&searchTerm=$sku where $sku is the 6 number SKU that Homebase uses (which is also the same as the last 6 numbers of any product page) I get no information.

When I run the URL through Redurect Detective it seems to be an unsupported browser issue, I presume because it's not a browser accessing the webite, and it knows this.

The Main Question:
How do I fix it so I can then get the correct source code to this page? (I am currently just getting blank text, not the product page(s) I want)

Related questions
There's a handful of other websites that don't "let me in" either, is this just cookie related, or is the coding of their website "smart enough" to know whether it's a browser or not? Can I fool it into thinking (this may be answered via my primary question).

Related extra information that may help answer the above.
* When I use CURLOPT_MAXREDIRS to 100, it seems to only let me go up to 40. Could this be the issue?
* using $sku = 323526; you should arrive at the final URL after some redirects to http://www.homebase.co.uk/en/homebaseuk/sovereign-petrol-self-propelled-rotary-mower---1493cc---40cm-323526. I do not need to know where the CURL ends up, I just want to be able to pinch the title, image and other info from the product page from knowing the SKU!

Community
  • 1
  • 1
themrflibble
  • 303
  • 3
  • 10
  • why are you switching browser UA strings partway through? it makes no sense to set FF 0.10.1 at the start, then change it to FF 31 before you do your exec call. – Marc B Apr 06 '15 at 18:21
  • That's because I copied two different functions together and didn't "see" this. Thank you, I have now changed it, but it has not changed my output. – themrflibble Apr 06 '15 at 18:34

0 Answers0