1

I'm trying to scrape some content from a site. I eventually discovered that it requires cookies, so I solved that with the guzzle cookie plugin. It's strange because I cannot get the content from doing a var_dump, but it will show the page if I do 'echo' which makes me think there is some dynamic data call, which gets the data. I'm quite used to api with guzzle but not sure I should treat this?, thanks

If I use domcrawler I get an error.

Code -

   use Symfony\Bundle\FrameworkBundle\Controller\Controller;

   use Symfony\Component\DomCrawler\Crawler;

   use Guzzle\Http\Client;

   use Guzzle\Plugin\Cookie\CookiePlugin;

   use Guzzle\Plugin\Cookie\CookieJar\ArrayCookieJar;

   $cookiePlugin = new CookiePlugin(new ArrayCookieJar());

     $url =  'http://www.myurl.com';
    // Add the cookie plugin to a client
     $client = new Client();

     $client->get();

    $client->addSubscriber($cookiePlugin);

  // Send the request with no cookies and parse the returned cookies
  $client->get($url)->send();

// Send the request again, noticing that cookies are being sent
  $request = $client->get($url);

  $response = $request->send();

 var_dump($response);
 $crawler = new Crawler($response);

  foreach ($crawler as $domElement) {
  print $domElement->filter('a')->links();
   }

error

    Expecting a DOMNodeList or DOMNode instance, an array, a   
  string,        or     null, but got "Guzzle\Http\Message\Response
GAV
  • 1,205
  • 2
  • 18
  • 38

2 Answers2

4

Try this:

For Guzzle 5

$crawler = new Crawler($response->getBody()->getContents());

http://docs.guzzlephp.org/en/latest/http-messages.html#id2 http://docs.guzzlephp.org/en/latest/streams.html#creating-streams

For Guzzle 3

$crawler = new Crawler($response->getBody());

http://guzzle3.readthedocs.org/http-client/response.html#response-body

Update

Basic usage of Guzzle 5 with getContents method.

include 'vendor/autoload.php';

use GuzzleHttp\Client;

$client = new Client();
echo $client->get('http://stackoverflow.com')->getBody()->getContents();

The rest is in doc (including cookie).

kba
  • 4,190
  • 2
  • 15
  • 24
  • It doesn't recognise getContents – GAV Apr 27 '15 at 16:04
  • I get the following error - Attempted to call method "getContents" on class "Guzzle\Http\EntityBody" – GAV Apr 27 '15 at 16:14
  • You are probably using outdated version 3 of Guzzle. Try to call only `getBody()` method or `$response->getBody()->__toString()`. – kba Apr 27 '15 at 17:46
  • No I'm using the latest. – GAV Apr 28 '15 at 08:57
  • Trust me, you don't. Class `Guzzle\Http\EntityBody` is in version 3 that is now deprecated. Newest version is [5.2.0](https://packagist.org/packages/guzzlehttp/guzzle). Did you try my last advice? – kba Apr 28 '15 at 09:04
  • I actually had version 5.0, I upgraded to 5.2 tried - $response->getBody()->getContents() got error - Attempted to call method "getContents" on class "Guzzle\Http\EntityBody" – GAV Apr 28 '15 at 14:39
  • Ok. I believe you but all classes that you use are from Guzzle 3 - `Guzzle\Http\Client`, `Guzzle\Plugin\Cookie\CookiePlugin` or `Guzzle\Http\EntityBody`. Guzzle 5 classes are under namespace `GuzzleHttp\*`. Could you show your composer.json? Maybe you have installed both versions and using Guzzle 3 by accident. – kba Apr 28 '15 at 14:52
  • "require": { "php": ">=5.4.0", "symfony/symfony": "2.5.*", "sensio/framework-extra-bundle": "~3.0", "incenteev/composer-parameter-handler": "~2.0", "guzzle/http/guzzle": "~5.2", "symfony/dom-crawler": "3.0.*@dev", "guzzle/plugin-cookie": "3.7.*@dev" }, if I do what you suggest GuzzleHttp\ I get cURL error 3 – GAV Apr 29 '15 at 08:37
  • in the terminal if I do "composer.phar update" after changing the composer.json I get this which is strange 'Package guzzle/common is abandoned, you should avoid using it. Use guzzle/guzzle instead. Package guzzle/stream is abandoned, you should avoid using it. Use guzzle/guzzle instead. Package guzzle/parser is abandoned, you should avoid using it. Use guzzle/guzzle instead. Package guzzle/http is abandoned, you should avoid using it. Use guzzle/guzzle instead. Package guzzle/plugin-cookie is abandoned, you should avoid using it. Use guzzle/guzzle instead.' – GAV Apr 29 '15 at 08:51
  • You have typo in dependencies. `"guzzle/http/guzzle"` should be `"guzzlehttp/guzzle"`. Delete `"guzzle/plugin-cookie"` and update your dependencies. I updated my answer with working example without cookie. In case of problem with composer, delete vendor folder, composer.lock and use install command. – kba Apr 29 '15 at 09:19
1

If you instantiate your crawler object like $crawler = new Crawler($response); you will receive all kinds of Uri based errors when you attempt to use any of the Form or Link based functions / features of the Crawler object.

I recommend instantiating your Crawler object like:

$crawler = new Symfony\Component\DomCrawler\Crawler(null, $response->getEffectiveUrl());

$crawler->addContent(
    $response->getBody()->__toString(),
    $response->getHeader('Content-Type')
);

This is also how the Symfony\Component\BrowswerKit\Client does it within the createCrawlerFromContent method. The Symfony\Component\Browerkit\Client is used internally by Goutte.

Shaun Bramley
  • 1,989
  • 11
  • 16