Writing a web crawler -- how do I emulate what Google does when it sees #! in a URL?

Question

I'm writing a web crawler and want to do what Google does when it encounters a #! URL in a page it has retrieved. If the URL doesn't have #! Google adds it to the list of pages that it will eventually fetch and index, but it does something special when it sees #! as described in Google's "Getting started with Ajax crawling" document.

When Google sees a URL that contains #! it modifies the URL, does an HTTP GET for the modified URL, then indexes the retrieved page as if it had retrieved the URL that has #! in it (rather than the URL that it actually retrieved). I'm trying to emulate the transformation that it does, which is not fully described.

The referenced page partly describes what Google does to modify the URL and tells web site authors how to reverse the transformation so that they can know what the original URL was and can return the data they want indexed under the #! URL. One thing that page says is: Note: The crawler escapes certain characters in the fragment during the transformation. To retrieve the original fragment, make sure to unescape all %XX characters in the fragment. More specifically, %26 should become &, %20 should become a space, %23 should become #, and %25 should become %, and so on.

"The transformation" mentioned is to replace #! with ?_escaped_fragment= and to escape some special characters in the text following the #!. That text tells web site authors to reverse the transformation by (in part) unescaping %XX in the text that -- in the modified URL -- follows ?_escaped_fragment=. The question is, how do I know what special characters to escape so that my crawler can request the same replacement URL that Google would request?

In the quoted paragraph, Google lists some that it will escape, but the "and so on" at the end suggests that the full list of escaped characters is longer -- but it is not fully described.

In theory every character (even letters) could be escaped as %XX but the chances that every web site would handle that correctly are not high. How can I figure out what characters Google will escape so my crawler will request the same URL that Google would?

(If I controlled a web site that logged incoming URLs, and that I could get Google to crawl, I could make a page that had a lot of URLs with special characters after the #!, and see what got escaped by looking at any URLs with ?_escaped_fragment_= -- but do I really have to set up an otherwise bogus web site to get an answer?)

score 0 · Answer 1 · answered Oct 08 '13 at 18:52

0

I totally missed the fact that the details of what characters get escaped is in this document:

The full "Ajax crawling" specification

I'm leaving this question here in case others need to find that spec.

answered Oct 08 '13 at 18:52

J.Merrill

1,233
14
27

Writing a web crawler -- how do I emulate what Google does when it sees #! in a URL?

1 Answers1