How to serve up HTML snapshots of an AJAX app with a headless browser, from PHP?

Question

Having real trouble working out how to fire up a headless browser to serve up static HTML snapshots of a site that uses javascript (sammy.js, to be specific) to deliver the AJAX content.

I'm working off Google's specification for making AJAX apps crawlable:

http://code.google.com/web/ajaxcrawling/docs/getting-started.html

which for the most part is great and very clear, and I'm having no problems picking up the ?_escaped_fragment_ URLs.

Most of the templating is done server side, so I was tempted to just write a PHP snapshot-building file that uses the same regex matches from the sammy app code (there are a lot of routes) to include in various template files. However, a lot of the action happens in the javascript app, so it would mean mirroring all of that processing in PHP, which then means maintaining both files side by side, cross-language - which is a lot of work!

Now, I've read that you can use a Headless Browser to 'render' the page and execute all the javascript (matching the #!/ route and delivering the correct content for the request) and then return the entire DOM contents as HTML, which would be served to googlebot.

I've searched long and hard and can't find any step-by-step guides on running headless browsers from PHP (for total Java newbs). Which I suppose means I just don't know what to search for.

What I'm wondering: is it even more work to set up and use a headless browser to serve up these HTML snapshots? And if so, is it worth doing anyway?

Also, if there are any guides you could point me to, that'd be great!

Thanks!

Joss

Maybe not - except that I'd imagine it'd be done using java technology. — William Joss Crowcroft, Mar 21 '11 at 04:14

score 2 · Answer 1 · answered Mar 20 '11 at 15:55

I think you're better off replicating on the server what you've got on the client side. Though it might seem like an inefficient undertaking, it's at least got a clear and limited scope.

Most of the reputable headless browsers are designed as testing tools for application development. Accordingly, they are very open-ended in their structure, which is a good thing if you're responsible for the QA of an application, but not so much if you want to do just one specific thing with it.

I used Selenium-RC to do just one specific thing on a particular project, and found that dealing with all the Selenium-related concerns quickly became a project unto itself. Though Selenium-RC could certainly accomplish what you're trying to do, it just seems like a big commitment given the specificity of what you're looking to accomplish.

(Being a complete Java amateur myself, I can't really comment on HTMLUnit, but on spec alone, it seems like it's probably more appropriate for your needs than Selenium-RC. It wouldn't surprise me though if using it had some of the same setup and management demands.)

So back to the alternative of duplicating everything in PHP...

Keep in mind that you don't need everything to be exactly identical in the HTML snapshots as they would be in-browser: as long as you've got the core content and the key navigational links, the GoogleBot will have most everything it needs. Do you also need to have every single page on your site indexed? Or could you identify the handful of routes that really matter most, and just provide snapshots of those? You could also use web analytics or server log data to better inform snapshot priorities.

Thanks for your answer! Can't vote up as not enough rep... Yeah it's funny, I think I must have had a slight misconception about the way headless browsers work.. having said that, the Google AJAX SEO guidelines do suggest it. I'm sure it's easy enough to do, if you work at Google.. I think I'll go the PHP route as you suggest. — William Joss Crowcroft, Mar 21 '11 at 04:12

score 0 · Answer 2 · answered Apr 26 '11 at 09:46

0

To anybody wondering - I've since worked out how to do exactly what was needed using node.js, and will publish it on github soon, and update the question...

answered Apr 26 '11 at 09:46

William Joss Crowcroft

947
6
15

Im still waiting too :) – Thomas Lang Sep 05 '13 at 21:48
I'm still waiting. I feel like the guardian of the Holy Grail in Indiana Jones and the Temple of Doom. – J.T. Feb 12 '14 at 19:03

How to serve up HTML snapshots of an AJAX app with a headless browser, from PHP?

2 Answers2