8

I am trying to scrape a shopping cart which uses the cookies for different currencies. When I load the site in chrome browser and inspecting with Cookie Inspector for Chrome, it shows the following cookies. enter image description here

When I try loading the same link with cURL

.example.com    TRUE    /   FALSE   1462357306  SSNC    CCSUBMIT-N
.example.com    TRUE    /   FALSE   1462357306  SSOE    PSORT-Y::CWR-on
.example.com    TRUE    /   FALSE   1464947780  SSLB    1
.example.com    TRUE    /   FALSE   1493891506  SSID_C  CACeuh1GAAAAAAAYxilXjl6BJhjGKVcBAAAAAABEVFFXGMYpVwANyBJPAAP1PQoAGMYpVwEAF04AA6sdCgAYxilXAQAOUAAD7V4KABjGKVcBACNQAAFUYgoAGMYpVwEAbk8AAQBICgAYxilXAQA
.example.com    TRUE    /   FALSE   0   SSSC_C  333.G6280768962372394638.1|19991.662955:20242.671221:20334.673792:20494.679661:20515.680532
.example.com    TRUE    /   FALSE   1493891506  SSRT_C  MsYpVwIBAw
.example.com    TRUE    /   FALSE   0   JSESSIONID  CDZHXpGSHymLMz4v!-751026475
.example.com    TRUE    /   FALSE   3609839127  mapp    0
.example.com    TRUE    /   FALSE   3609839153  dpi 2097201|2|release20160420v10t155721155722
.example.com    TRUE    /   FALSE   3609839153  lpi 2114737|2|release20160420v10t155721155722
.example.com    TRUE    /   FALSE   0   TS0119d048  01efad4706976f70b8f767b422999889abdfa7e7a9a300a247ca3f6dec4997a3ea8a5c9dbe800783f83027f6f389b2fc4134a3806b1de11ca96bf39add105698b8c22f1d300d568ea4395ae6adf29723d2f482180be92caa38977c2da954baebe461814696e5ca8be3f2f7087360909df7e5694ec8f5965475bfd2591cc6c843a2b4aac4752758d5cb2659b390c7632b7047ffdfe2
www.example.com FALSE   /   FALSE   0   TS01472329  01efad4706512021fdee50b1b891941c232f4ef7f5bf2d184606446c9ebf492848a3eab610
.example.com    TRUE    /   FALSE   3609839153  uui 800.606.6969%20/%20212.444.6615|
.example.com    TRUE    /   FALSE   0   ci  NS=Y|CM_MMC=|
.example.com    TRUE    /   FALSE   0   TS01c1e793  01efad47067448a038c37bf93bcdabbce3f89810c9711adfcf2561c8b38484b01c4523479562e5435383034ba6b231a0e3428234fab56386e2af0810f02b7abcf5f2d79d6e
.example.com    TRUE    /   FALSE   3609839153  sessionKey  CDZHXpGSHymLMz4v!-751026475!1462355506133
.example.com    TRUE    /   FALSE   3609839127  cookieID    89789790961462355480485
.example.com    TRUE    /   FALSE   0   dlc NS=Y|CM_MMC=|EMLH=|

Which clearly misses the highlighted cookies in the image. I also tried removing all the cookies and disabled the JS and reloaded the page in browser and still those two cookies exist. So these cookies are not created using JS.

The code that I have used:

$URL = "http://www.example.com/";
//ini_set('user_agent', 'Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0 FirePHP/0.5 ');
//$context = stream_context_create (array ('http' => array ('timeout' => 60)));
$this->ch = curl_init();
$curlHeaders = array(
        'Host: www.example.com',
        'Connection: keep-alive',
        'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Upgrade-Insecure-Requests: 1',
        'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36',
        'Accept-Encoding: gzip, deflate, sdch',
        'Accept-Language: en-US,en;q=0.8',
        'Cookie: _gat=1'
);


$cookie = 'cookies.txt';

// visit the homepage to set the cookie properly
//$ch = curl_init();

$agent= 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13';
curl_setopt($this->ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($this->ch, CURLOPT_VERBOSE, true);
curl_setopt($this->ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($this->ch, CURLOPT_HEADER, false);
curl_setopt($this->ch, CURLOPT_HTTPGET, true);
curl_setopt($this->ch, CURLOPT_USERAGENT, $agent);
curl_setopt($this->ch, CURLOPT_HTTPHEADER, $curlHeaders);
curl_setopt($this->ch, CURLOPT_URL, $URL);
curl_setopt($this->ch, CURLOPT_COOKIEJAR, $cookie);
curl_setopt($this->ch, CURLOPT_COOKIESESSION, true);
curl_setopt($this->ch, CURLOPT_FOLLOWLOCATION, true);

ob_start();      // prevent any output
curl_exec ($this->ch); // execute the curl command
ob_end_clean();  // stop preventing output

//URL that loads when I change the currency from USD to AUD
    $ausURL = "http://www.example.com/bnh/controller/home?O=RootPage.jsp&A=SetCurrency&Q=&saveCUR=Y&code=AUD";

    curl_setopt($this->ch, CURLOPT_URL, $ausURL);


$url="www.example.com/productPage/";
curl_exec ($this->ch);
curl_setopt($this->ch, CURLOPT_ENCODING, "gzip");
curl_setopt($this->ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($this->ch, CURLOPT_REFERER, "http://www.example.com/bnh/controller/home?O=RootPage.jsp&A=SetCurrency&Q=&saveCUR=Y&code=AUD");
curl_setopt($this->ch, CURLOPT_URL,$url);
curl_setopt($this->ch, CURLOPT_COOKIEFILE, $cookie);    
$buffer = curl_exec($this->ch);
$fh = fopen($this->myFile,'w') or die("can't open file");
fwrite($fh, $buffer." -----------------buffer--------------------");
//fclose($fh);
return $buffer;

It still yields USD Pricing through CURL.

dharanbro
  • 1,327
  • 4
  • 17
  • 40

1 Answers1

0

The site that you try parsing is protect with DISTIL http://www.distilnetworks.com/ . They use various methods to detect parsing content and prevents from price grabbing.

DISTIL puts hidden scripts into each page, for validating browser. So for normal work, site also requires JAVASCRIPT to be enabled to.