Not solving the wrong problem

Published:

I like a great deal of what Google does for the open web. They sponsor standards work, they are working on an open source browser, they are building documentation on the state of the web for web developers. It's all really great. Today they posted what they called A Proposal For Making AJAX Crawlable. It seems like a great idea. More and more of the web isn't reached by users clicking on a conventional <a href="http://... link but by executing JavaScript that dynamically loads content off of the server. It's somewhere between really hard and impossible for web crawlers to fully and correctly index sites that work that way without the sites' developers taking crawlers into account.

Google's proposal is to define a convention for URLs that contain state information in the anchor and to define a convention for retrieving the canonical, indexable contents of the an URL with such an anchor tag. First let me dismiss the suggestion that you make a headless browser available over HTTP to render your AJAX pages to HTML out of hand. If it's so easy for HtmlUnit to render your AJAX to HTML, surely Google can do it. And basically offering HtmlUnit as a web service on your server doesn't sound that secure or scalable to me.

The bigger question is that if your solution requires the server to be able to serve the correct HTML for any state, would you come up with the same solution as Google? There is a simple, straight-forward solution that works today and is used on sites all over the internet. If the content you serve includes the static, non AJAX URLs in anchor HREFs but uses JS click handlers to do AJAX loads then crawlers can scrape all of your pages, users of modern browsers get the full shiny experience and users on old mobile browsers that don't support JS get to work for free!

To do this you can either make your AJAX templates include onclick handlers or you can write a simple piece of JS to do the right thing when any link is clicked on. A contrived example using jQuery might look like:

      $(function(event) {
        $('body').click(function(event) {
          var href = $(event.target).attr('href');
          // don't try to AJAX absolute URLs
          if (href.match('https?://')) return;
          // don't let the normal browser navigation operate
          event.preventDefault()
          // based on event.target.href, decide what AJAX URL to load.
          $('#ajaxframe').load('/load-fragment', {path: href});
          // update the URL bar
          document.location.hash=href;
        });
      });
This will intercept clicks on relative anchor tags and let your page JS do its AJAX magic. It doesn't require special conventions. If you build your site this way you'll probably find that the state that is in your URL fragments is a the relative URL for the page on your site. So http://www.example.com/random/page and http://www.example.com/#/random/page have the same meaning. That turns out to be a pretty good convention. After all, aren't our URLs supposed to refer to resources anyway?