Do I need a headless browser to scrape for CSS attributes

Question

My goal is to pull particular CSS attribute values off a webpage. I've set up a scraper using Guzzle and Symfony's css-selector. However, I've realized that the css-selector doesn't work the same as jQuery, as far as I can tell there's no .attr() method.

Am I correct in thinking that I need to use a headless browser, mink, headless chrome, phantom.js, in order to render the page, then find the attributes?

What attribute are you trying to get? The [Symfony crawler](https://symfony.com/doc/current/testing.html#extracting-information) has an `attr` method. — Alvin Bunk, Jun 18 '17 at 00:34
CSS attributes, like font family, color, etc. I think I need to render the page with a browser in order to get the active attributes. Symfony crawler looks like it's getting the HTML attributes. — icicleking, Jun 19 '17 at 03:00

score 1 · Answer 1 · answered Jun 19 '17 at 07:53

Mink is a good option because of the api it offers and the power it has allowing to interact with several drivers (goutte, gecko/firefox...).

If the css generated is not modified by javascript, mink+goutte may be the best option, but if the css is modified somehow by javascript a mink+selenium configuration may be the best (or mink+zombie). Have in mind that this second approach is harder to setup and slower than the "goutte" one.

The way you access the dom is different than jQuery, but the selectors are about the same, in fact mink offers you 4 types of selectors.

You can do almost everything with "xpath" selector. I also recommend considering "css" + NodeElement methods, because it's simpler and helps in most of cases.

Here you are one example based on wikipedia with 2 approaches:

Imagine you go to wikiperia.org and you want to keep the English entry link:

$xPath = '//a[@id="js-link-box-en"]/@href';
$nodeElement = $this->getSession()->getPage()->find('xpath', $xPath);
$theHrefValue = $nodeElement->getText();

Alternativelly:

 $nodeElement = $this->getSession()->getPage()->find('css', '#js-link-box-en')
 $theHrefValue = $nodeElement->getAttribute('href');

I hope it will help you when making a decision :)

Mink looks like an interesting solution. I guess I have yet to see code that purports to do what I'm looking for, namely retrieve the css attribute values from a webpage. In your code above it looks like you're looking for the `href` attribute on the node. I'm looking to get, say, the color of the link. Does Goutte do this? I realize in my question I'm saying "CSS attribute", not "CSS attribute value." Updating for clarity. — icicleking, Jun 22 '17 at 18:39
I am afraid with goutte you can't do that. Instead of Goutte you can use Selenium2Driver and evaluate javascript, so, you can "run jquery on server side" to get the same functionallity as if you where using jquery at client side, if that make sense to you. — Samuel Vicent, Jun 25 '17 at 11:20

score 1 · Accepted Answer · answered Jun 22 '17 at 18:51

PhantomJS (http://phantomjs.org/) is a good one which I use for Unit Testing.

Chrome just released in v59 the ability to run their browser in headless mode. However it does work for windows yet.

Headless Chrome is shipping in Chrome 59. It's a way to run the Chrome browser in a headless environment. Essentially, running Chrome without chrome! It brings all modern web platform features provided by Chromium and the Blink rendering engine to the command line.

Why is that useful?

A headless browser is a great tool for automated testing and server environments where you don't need a visible UI shell. For example, you may want to run some tests against a real web page, create a PDF of it, or just inspect how the browser renders an URL.

Caution: Headless mode is available on Mac and Linux in Chrome 59. Windows support is coming in Chrome 60. To check what version of Chrome you have, open chrome://version.

You can find more info here: https://developers.google.com/web/updates/2017/04/headless-chrome

Do I need a headless browser to scrape for CSS attributes

2 Answers2