What's the least redundant way to make a site with JavaScript-generated HTML crawlable?

Question

After reading Google's policy on making Ajax-generated content crawlable, along with many developers' blog posts and Stackoverflow Q&A threads on the subject, I'm left with the conclusion that there is no way to make a site with only JavaScript/Ajax-generated HTML crawlable. A site I'm currently working isn't getting a fair amount of its content indexed. All of the presentation layer for our non-indexed content is built in JavaScript by generating HTML from JSON returned from Ajax-based web service calls, and we believe Google is not indexing the content because of that. Is that correct?

The only solution seems to be to also have a "fall-back" version of the site for search engines (specifically Google) where all the HTML and content would be generated as it traditionally has been, on the server-side. For clients with JavaScript enabled, it seems that we could use essentially the same approach that we do now: using JavaScript to generate HTML from asynchronously loaded JSON.

Reading around, my understanding is that the current best practice for applying the DRY principle in creating crawlable Ajax-generated websites as described above is to use a templating engine that can use the same templates on the client-side and the server-side. For clients with JavaScript enabled, the client-side templating engine, for example mustache.js, would transform JSON data sent from the server into HTML as defined by its copy of a template file. And for search crawlers and clients with JavaScript disabled, the server-side implementation of the same templating engine, for example mustache.java, would similarly operate on its copy of the same exact template file to output HTML.

If that solution is correct, then how is this different than approaches used 4 or 5 years ago by front-end heavy sites, where sites essentially had to maintain two copies of the templating code, one copy for users with JavaScript enabled (nearly everyone) and another copy (e.g. in FreeMarker or Velocity) for search engines and browsers without JavaScript enabled (nearly noone)? It seems like there should be a better way.

Does this imply that two templating model layers would need to be maintained, one on the client-side and one on the server-side? How advisable is it to combine those client-side templates with a front-end MVC (MV/MVVC) framework like Backbone.js, Ember.js, or YUI App Library? How do these solutions affect maintenance costs? Would it be better to try doing this without introducing more frameworks -- a new templating engine and a front-end MVC framework -- into a development team's technology stack? Is there a way to do this less redundantly?

If that solution isn't correct, then is there something we're missing and could be doing better with our JavaScript to keep our existing asynchronous HTML-from-JSON structure and get it indexed, so we don’t need to introduce something new to the architecture stack? We really rather wouldn't have to update two versions of the presentation layer when the business needs change.

Sorry I'm not gonna be any helpful in this matter, I just wanted to say _"Yes!, indeed this is a pain in the ass"_ — fguillen, Apr 19 '12 at 08:56

score 10 · Answer 1 · answered Apr 23 '12 at 03:00

10

Why didn't I think of this before! Just use http://phantomjs.org. It's a headless webkit browser. You'd just build a set of actions to crawl the UI and capture the html at every state you'd like. Phantom can turn the captured html into .html files for you and save them to your web server.

The whole thing would be automated every build/commit (PhantomJS is command line driven). The JS code you write to crawl the UI would break as you change the UI, but it shouldn't be any worse than automated UI testing, and it's just Javascript so you can use jQuery selectors to grab buttons and click them.

If I had to solve the SEO problem, this is definitely the first approach I'd prototype. Crawl and save, baby. Yessir.

answered Apr 23 '12 at 03:00

SimplGy

20,079
15
107
144

2

From glancing at Phantom.js, it seems like it might be a decent solution for capturing static pages, which could then be returned for clients without JavaScript enabled as you say. But would it solve the problem for dynamic pages, which have non-static content that needs to be crawled and indexed? For example, consider a page with variable values for 'title' and 'h1' elements. – jqp May 07 '12 at 17:57
I believe so. Sometimes 'static' and 'dynamic' are overloaded terms though. Phantom.js will render in its headless browser engine whatever a normal webkit browser would render. This includes running javascript, firing ajax requests, and (if you tell Phantom to do so) clicking on links like "load more rows of data". – SimplGy May 08 '12 at 14:32
The pseudocode would be something like: ' var page = Phantom.visit('www.google.com') page.find('input.search').val('find pretty hats') page.find('input[type=submit]').click() ' Of course, my syntax is all wrong. :) – SimplGy May 08 '12 at 14:34
By "dynamic" I mean, for example, a biography page that has fields for name, DOB, etc. but where values for those fields vary from person to person (each person having a different biography page). Say those field values (e.g. "John Doe" for "name") have SEO value. From my understanding of Phantom.js, each person's biography page would need to be scraped, have a static HTML file produced for the rendered data, and have that static HTML file copied to the server. If there were more than a handful of these biography pages, then this approach seems untenable. Is that the case with Phantom.js? – jqp May 08 '12 at 17:59
Ah, yes. So you'd want to write a loop in Phantom.js that accesses the dynamic page for every user. You'll want to provide a list of users. Maybe you write a simple web service that returns a list of all user ids, then in phantom you can write something like: ' var users = ajaxGet('svc/getAllUsers'); for(var i = 0; i < users.length; i++) { var page = Phantom.visit('userBio?id='+users[i].id); page.saveStaticHtml(); } – SimplGy May 11 '12 at 15:27
The problem with this approach is that it requires creating a new static HTML page for (continuing this example) every biography page. If the site has 10,000+ biography pages, this approach would require 10,000+ individual HTML pages. Further, it seems like it would be a huge undertaking to quickly update those static HTML pages as the data behind them were updated. This is a creative idea, but it seems like it would have major problems on a large site in which data were updated with moderate, random frequency. Server-side templates for non-JS clients seem like a much better approach. – jqp May 11 '12 at 16:52
I just found http://nrabinowitz.github.com/pjscrape/, a Phantom.js-based framework that crawls/recursively scrapes a site. Something like this would likely be needed with this approach. Although significant issues would remain, pjscrape makes this Phantom.js approach seem a bit more viable. It's still unclear whether server-side templates or this approach would be better. – jqp May 11 '12 at 18:21
This answer simply **trades one type of duplication for another**. For any data driven website, rebuilding this static html index of the entire site each build is simply far too infrequent. Doing it at runtime for the google bot not only duplicates the content, but requires extra logic to simulate clicks and user interactions. **And** you have to manage & invalidate a cache of static html. – Zachary Yates Feb 12 '13 at 18:15

score 3 · Answer 2 · answered Apr 23 '12 at 02:56

I think a combination of a few technologies and one manually coded hack which you could reuse would fix you right. Here's my crazy, half baked idea. It's theoretical and probably not complete. Step 1:

Use client side templates, like you suggest. Put every template in a separate file (so that you can reuse them easily between the client and the server)
Use underscore.js templating, or reconfigure Mustache. This way you'll get ERB style delimiters in your template, identical to Java's <%= %> stuff.
Since they're separate files, you'll want to start developing in CommonJS modules with a module loader like curl.js or require.js to load the templates in your client side code. If you aren't doing modular development, it's pretty awesome. I started ~a month ago. Seems hard at first but it's just a different way to wrap your code: http://addyosmani.com/writing-modular-js/

Ok, so now you have isolated templates. Now we just need to figure out how to build a flat page out of them on the server. I only see two approaches. Step 2:

You could annotate your JS so that the server can read it and see a default path for ajax calls and what templates they link to then the server can use the annotations to call the controller methods in the right order and fill out a flat page.
Or you could annotate your templates to indicate which controller they should call and provide example call params. This would be easy to maintain and would benefit front end devs like me who have to look up controller URLs all the time. It would also tell your back end code what to call.

Hope this helps. Curious to hear the best answer to this. An interesting problem.

score 1 · Answer 3 · answered Apr 24 '12 at 02:10

1

Use Distal templates. Your website is static HTML which is crawlable and Distal treates the static HTML as a template.

answered Apr 24 '12 at 02:10

Kernel James

3,752
25
32

This looks like a cool templating engine, but I don't see how it's different from _.template or Mustache in terms of auto-filling-out on the server side – SimplGy Apr 24 '12 at 18:13
If you want a site that's both crawlable and dynamic and don't want to maintain 2 sets of templates (1 generated by the server so it's crawlable and 1 for use by Javascript) then you can't use Underscore. Underscore requires you to store a copy of the templates inside the Javascript. – Kernel James Apr 25 '12 at 03:13
2

Not quite true, although that is the default use case. If you put a template from any engine in a separate file and load the files using a CommonJS loader, they are not required to be stored in javascript or in html, so you can reuse them server-side. – SimplGy Apr 25 '12 at 14:39

score 0 · Answer 4 · answered Aug 06 '13 at 22:20

We do use PhantomJS for this purpose just like simple as could be said. That works great if you have the rights to use that on your host.

If that is not an option or if you simply don't want to deal with that yourselves. We do have a free service doing this. See this post for more info: http://rogeralsing.com/2013/08/06/seo-indexing-angularjs-sites-or-other-ajax-sites-with-wombit-crawlr/

score 0 · Answer 5 · answered Nov 07 '13 at 15:05

I have found a solution that does not require any Java, Node.js or any other way to make a redundant copy of a JS code generating website. Also it supports all browsers.

So what you need to do is provide the snapshot for Google. It's the best solution, because you dont need to mess with other URLS and so on. Also: you don't add noscript to your basic website so it's lighter.

How to make a snapshot? Phantomjs, HTMLUnit and so on require a server where you can put it and call. You need to configure it, and combine with u website. And this is a mess. Unfortunately there is no PHP headless browser. It's obvious because of the specifics of PHP.

So what is the other way of getting snapshot? Well... if user opens website you can get the snapshot of what he sees with JS (innerHTML).

So what you need to do is:

check if you need a snapshot for your site (if you have it, you don't need to take another)
you send this snapshot to server for saving to file (PHP handles the POST request with snapshot, and saves to file)

And if Google Bot visits your hash bang website you get the file of the snapshot for the page requested.

Things to solve:

safety: you don't want any script from user or his browser (injection) save to snapshot, maybe its best only that you can generate snapshots (see sitemap below)
compatibility: you don't want to save from any browser but from one that supports your website the best
don't bother mobile: just don't use mobile users to generate snapshots so page will be not slower for them
failover: if you don't have snapshot output standard website - its nothing good for Google, but its still better than nothing

Also there is one thing: not all pages will be visited by users but you need snapshots for the Google before they visit.

So what to do? There is solution for this also:

generate sitemap that has all the pages you have on website (it must be generated on fly to be up to date, and crawler soft does not help because it does not execute JS)
visit in any way pages from the sitemap that does not have snapshot. This will call snapshot code and generate it properly
visit regularly (daily?)

But hey, how to visit all those pages? Well. There are some solutions for this:

write a app in Java, C# or other language to get pages to visit from server and visit it with built in browser control. Add this to your schedule on server.
write a JS script that opens required pages in iFRAME one by another. Add this to your schedule on a computer.
just open the script mentioned above manually if your site is mostly static

Also remember to refresh old snaps ocassionally to make them up to date.

I hope to hear from you what do you think about this solution.

What's the least redundant way to make a site with JavaScript-generated HTML crawlable?

5 Answers5

Linked