Ajax Pages Convert To Static Agreement For Crawlers

https://developers.google.com/webmasters/ajax-crawling/docs/specification

 

This document describes an agreement between web servers and search engine crawlers that allows for dynamically created content to be visible to crawlers. Google currently supports this agreement. The hope is that other search engines will also adopt this proposal.

Basic definitions

  • Web application: In this document, a web application is an AJAX-enabled, interactive web application.
  • State: While traditional static web sites consist of many pages, a more appropriate term for AJAX applications is “state”. An application consists of a number of states, where each state constitutes a specific user experience or a response to user input. Examples of states: For a mail application, states could be base state, inbox, compose, etc. For a chess application, states could be base state, start new game, but also current state x of the chessboard, including information about past moves, whose player’s turn it is, and so forth. In an AJAX application, a state often corresponds to a URL with a hash fragment.
  • Hash fragments: Traditionally, hash fragments (that is, everything after # in the URL) have been used to indicate one portion of a static HTML document. By contrast, AJAX applications often use hash fragments in another function, namely to indicate state. For example, when a user navigates to the URLhttp://www.example.com/ajax.html#key1=value1&key2=value2, the AJAX application will parse the hash fragment and move the application to the “key1=value1&key2=value2” state. This is similar in spirit to moving to a portion of a static document, that is, the traditional use of hash fragments. History (the back button) in AJAX applications is generally handled with these hash fragments as well. Why are hash fragments used in this way? While the same effect could often be achieved with query parameters (for example, ?key1=value1&key2=value2), hash fragments have the advantage that in and of themselves, they do not incur an HTTP request and thus no round-trip from the browser to the server and back. In other words, when navigating fromwww.example.com/ajax.html towww.example.com/ajax.html#key1=value1&key2=value2, the web application moves to the state key1=value1&key2=value2 without a full page refresh. As such, hash fragments are an important tool in making AJAX applications fast and responsive. Importantly, however, hash fragments are not part of HTTP requests (and as a result they are not sent to the server), which is why our approach must handle them in a new way. See RFC 3986 for more details on hash fragments.
  • Query parameters: Query parameters (for example, ?s=value in the URL) are used by web sites and applications to post to or obtain information from the server. They incur a server round-trip and full page reload. In other words, navigating from www.example.com to www.example.com?s=value is handled by an HTTP request to the server and a full page reload. See RFC 3986 for more details. Query parameters are routinely used in AJAX applications as well.
  • HTML snapshot: An HTML snapshot is the serialization of the DOM the browser will produce when loading the page, including executing any JavaScript that is needed to get the intial page.
  • Pretty URL: Any URL containing a hash fragment beginning with !, for example, www.example.com?myquery#!key1=value1&key2=value2
  • Ugly URL: Any URL containing a query parameter with the key _escaped_fragment_, for example,www.example.com?myquery&_escaped_fragment_=key1=value1%26key2=value2.

Bidirectional between #! URL to _escaped_fragment_ URL

A bidirectional mapping exists between pretty and ugly URLs:

?_escaped_fragment_=key1=value1%26key2=value2: used for crawling only, indicates an indexable AJAX app state
#!key1=value1&key2=value2: used for normal (browser) web site interaction

 

Mapping from #! to _escaped_fragment_ format

Each URL that contains a hash fragment beginning with the exclamation mark is considered a #! URL. Note that any URL may contain at most one hash fragment. Each pretty (#!) URL has a corresponding ugly (_escaped_fragment_) URL, which is derived with the following steps:

  1. The hash fragment becomes part of the query parameters.
  2. The hash fragment is indicated in the query parameters by preceding it with _escaped_fragment_=
  3. Some characters are escaped when the hash fragment becomes part of the query parameters. These characters are listed below.
  4. All other parts of the URL (host, port, path, existing query parameters, and so on) remain unchanged.

Mapping from _escaped_fragment_ format to #! format

Any URL whose query parameters contain the special token _escaped_fragment_ as the last query parameter is considered an _escaped_fragment_ URL. Further, there must only be one_escaped_fragment_ in the URL, and it must be the last query parameter. The corresponding #! URL can be derived with the following steps:

  1. Remove from the URL all tokens beginning with _escaped_fragment_= (Note especially that the =must be removed as well).
  2. Remove from the URL the trailing ? or & (depending on whether the URL had query parameters other than _escaped_fragment_).
  3. Add to the URL the tokens #!.
  4. Add to the URL all tokens after _escaped_fragment_= after unescaping them.

Note: As is explained below, there is a special syntax for pages without hash fragments, but that still contain dynamic Ajax content. For those pages, to map from the _escaped_fragment_ URL to the original URL, omit steps 3 and 4 above.

Escaping characters in the bidirectional mapping

The following characters will be escaped when moving the hash fragment string to the query parameters of the URL, and must be unescaped by the web server to obtain the original URL:

  • %00..20
  • %23
  • %25..26
  • %2B
  • %7F..FF

Control characters (0x00..1F and 0x7F) should be avoided. Non-ASCII text will be converted to UTF-8 before escaping.

Role of the Search Engine Crawler

Transformation of URL

  1. URLs of the format domain[:port]/path#!hashfragment, for example,www.example.com#!key1=value1&key2=value2 are temporarily transformed intodomain[:port]/path?_escaped_fragment_=hashfragment, such as www.example.com?_escaped_fragment_=key1=value1%26key2=value2. In other words, a hash fragment beginning with an exclamation mark (‘!’) is turned into a query parameter. We refer to the former as “pretty URLs” and to the latter as “ugly URLs”.
  2. URLs of the format domain[:port]/path?queryparams#!hashfragment (for example,www.example.com?user=userid#!key1=value1&key2=value2) are temporarily transformed intodomain[:port]/path?queryparams&_escaped_fragment_=hashfragment (for the above example, www.example.com?user=userid&_escaped_fragment_=key1=value1%26key2=value2). In other words, a hash fragment beginning with an exclamation mark (‘!’) is made part of the existing query parameters by adding a query parameter with the key “_escaped_fragment_” and the value of the hash fragment without the “!”. As in this case the URL already contains query parameters, the new query parameter is delimited from the existing ones with the standard delimiter ‘&’. We refer to the former #! as “pretty URLs” and to the latter _escaped_fragment_ URLs as “ugly URLs”.
  3. Some characters are escaped when making a hash fragment part of the query parameters. See the previous section for more information.
  4. If a page has no hash fragments, but contains <meta name="fragment" content="!"> in the<head> of the HTML, the crawler will transform the URL of this page from domain[:port]/path todomain[:port]/path?_escaped_fragment= (or domain[:port]/path?queryparams todomain[:port]/path?queryparams&_escaped_fragment_= and will then access the transformed URL. For example, if www.example.com contains <meta name="fragment" content="!"> in the head, the crawler will transform this URL into www.example.com?_escaped_fragment_= and fetch www.example.com?_escaped_fragment_= from the web server.

Request

The crawler agrees to request from the server ugly URLs of the format:

  • domain[:port]/path?_escaped_fragment_=hashfragment
  • domain[:port]/path?queryparams&_escaped_fragment_=hashfragment
  • domain[:port]/path?_escaped_fragment_=
  • domain[:port]/path?queryparams&_escaped_fragment_=

Search result

The search engine agrees to display in the search results the corresponding pretty URLs:

  • domain[:port]/path#!hashfragment
  • domain[:port]/path?queryparams#!hashfragment
  • domain[:port]/path
  • domain[:port]/path?queryparams

Role of the application and web server

Opting into the AJAX crawling scheme

The application must opt into the AJAX crawling scheme to notify the crawler to request ugly URLs. An application can opt in with either or both of the following:

  • Use #! in your site’s hash fragments.
  • Add a trigger to the head of the HTML of a page without a hash fragment (for example, your home page):
    <meta name="fragment" content="!">

     

     

Once the scheme is implemented, AJAX URLs containing hash fragments with #! are eligible to be crawled and indexed by the search engine.

Transformation of URL

In response to a request of a URL that contains _escaped_fragment_ (which should always be a request from a crawler), the server agrees to return an HTML snapshot of the corresponding pretty #! URL. See above for the mapping between _escaped_fragment_ (ugly) URLs and #! (pretty) URLs.

Serving the HTML snapshot corresponding to the dynamic page

In response to an _escaped_fragment_ URL, the origin server agrees to return to the crawler an HTML snapshot of the corresponding #! URL. The HTML snapshot must contain the same content as the dynamically created page.

HTML snapshots can be obtained in an offline process or dynamically in response to a crawler request. For a guide on producing an HTML snapshot, see the HTML snapshot section.

Pages without hash fragments

It may be impossible or undesirable for some pages to have hash fragments in their URLs. For this reason, this scheme has a special provision for such pages: in order to indicate that a page without a hash fragment should be crawled again in _escaped_fragment_ form, it is possible to embed a special meta tag into the head of its HTML. The syntax for this meta tag is as follows:

 

<meta name="fragment" content="!">

The following important restrictions apply:

  1. The meta tag may only appear in pages without hash fragments.
  2. Only “!” may appear in the content field.
  3. The meta tag must appear in the head of the document.

The crawler treats this meta tag as follows: If the page www.example.com contains the meta tag in its head, the crawler will retrieve the URL www.example.com?_escaped_fragment_=. It will index the content of the page as www.example.com and will display www.example.com in search results.

As noted above, the mapping from the _escaped_fragment_ to the #! syntax is slightly different in this case: to retrieve the original URL, the web server instead simply removes the tokens_escaped_fragment_= (note the =) from the URL. In other words, you want to end up with the URLwww.example.com instead of www.example.com#!.

Warning: Should the content for www.example.com?_escaped_fragment_= return a 404 code, no content will be indexed for www.example.com! So, be careful if you add this meta tag to your page and make sure an HTML snapshot is returned.

In order to crawl your site’s URLs, a crawler must be able to find them. Here are two common ways to accomplish this:

  1. Hyperlinks: An HTML page or an HTML snapshot can contain hyperlinks to pretty URLs, that is, URLs containing #! hash fragments. Note: The crawler will not follow links extracted from HTML that contain_escaped_fragment_.
  2. Sitemap: Pretty URLs may be listed in Sitemaps. For more information on Sitemaps, please seewww.sitemaps.org.

Backward compatibility to current practice

Current practices will still be supported. Hijax remains a valid solution, as we describe here. Giving the crawler access to static content remains the main goal.

Existing uses of #!

A few web pages already use exclamation marks as the first character in a hash fragment. Because hash fragments are not a part of the URL that are sent to a server, such URLs have never been crawled. In other words, such URLs are not currently in the search index.

Under the new scheme, they can be crawled. In other words, a crawler will map each #! URL to its corresponding _escaped_fragment_ URL and request this URL from the web server. Because the site uses the pretty URL syntax (that is, #! hash fragments), the crawler will assume that the site has opted into the AJAX crawling scheme. This can cause problems, because the crawler will not get any meaningful content for these URLs if the web server does not return an HTML snapshot.

There are two options:

  1. The site adopts the AJAX crawling scheme and returns HTML snapshots.
  2. If this is not desired, it is possible to opt out out of the scheme by adding a directive to therobots.txt file:Disallow: /*_escaped_fragment_

Changing the browser-URL without refreshing page

An often overlooked feature of HTML5 is the new “onpopstate” event.

This new feature offers you a way to change the URL displayed in the browser* through javascript without reloading the page. It will also create a back-button event and you even have a state object you can interact with.

This means you won’t have to use the hash-hack anymore if you want add state to your AJAX-application, and search engines will be able to index your pages too.

So how does it work? Well, it’s fairly simple. In Chrome you write:

window.history.pushState(“object or string”, “Title”, “/new-url”);

Executing this line of code will change the URL to my-domain.com/new-url (3rd option). The “Title” string (2nd option) is intended to describe the new state, and will not change the title of the document as one might otherwise expect. The W3 documentation states:

“Titles associated with session history entries need not have any relation with the current title of the Document. The title of a session history entry is intended to explain the state of the document at that point, so that the user can navigate the document’s history.”

So if you want the document’s title to change to match the title of the history entry, you’ll need to write a hook for that (hint: just tie a function to the onpopstate event). Finally, “object or string” (1st option) is a way to pass an object to the state which you can then use to manipulate the page.

You can programmatically invoke the back-function by running:

window.history.back();

And you can of course go forward too:

window.history.forward();

Or even go to a specific history state:

window.history.go(2);

The object you pass as the first option to the pushState function will stay with each state, so if you go back in the history, you’ll get the object for that state. If you need to manipulate a state (instead of creating a new one) you can use:

window.history.replaceState(“object or string”, “Title”, “/another-new-url”);

Note that while this will change the URL of the page, it will not allow the user to click the back-button to go back to the previous state because you’re replacing the current state, not adding a new one. So, this is the correct behaviour.

Personally, I think the URL should be the first parameter and then the two other options should be optional. Regardless, this feature will certainly come in handy when working with AJAX- and Flash-applications that need state (read: bookmarkable pages and back-button support). Anyone looking to make their Flash- or AJAX-application indexable by search engines so they will get better raking in Google and the likes, should also have a look at this new feature.

The most prominent implementation of this HTML5-feature that I’ve seen is in the new Flickr layout. Here’s anexample page (remember to enable the new layout if you haven’t already). Now, if you’re using the latest version of Chrome or Safari and click one of the sets, e.g. “Strobist”, it will slide open and the URL will change but you’ll notice that the page doesn’t reload.

It’s worth noting that Flickr uses replaceState instead of pushState – in other words, they don’t add a back-button event. I’m guessing they feel that switching back and forth between opened/closed sets is too small a change for a back-button event (I’d certainly agree with them on that decision), so instead they just replace the URL so if you copy/paste the link to a friend, they’ll see the exact same page that you did.

Another interesting thing is how Flickr still use the old hash-hack as a fallback if you’re running on browsers that don’t support this new HTML5-feature. I predict/hope that a lot of the plugins that help you easily implement the hash-hack will bake this into their core so people with new browsers can start reaping the benefits.

The latest versions of Chrome and Safari already have support for “onpopstate” and Firefox 4 will have support for it as well. Unfortunately, it seems like IE9 won’t be supporting this feature if we are to believe this Wikipedia article(“Trident” is IE’s layout engine).

Check out the W3 specification for more info.

* For security reasons, you can only change the path of the URL, not the domain itself. So you can change anything in the URL after my-domain.com/[change-the-stuff-here.html].


 

https://github.com/browserstate/history.js

History.js gracefully supports the HTML5 History/State APIs (pushState, replaceState, onPopState) in all browsers. Including continued support for data, titles, replaceState. Supports jQuery, MooTools and Prototype. For HTML5 browsers this means that you can modify the URL directly, without needing to use hashes anymore.


 

http://www.tinywall.info/2012/02/change-browser-url-without-page-reload-refresh-with-ajax-request-using-javascript-html5-history-api-php-jquery-like-facebook-github-navigation-menu.html

When you are working with ajax, the problem is that after you have loaded some content using ajax, you can’t change the URL of the browser according to the content. Because of this, reloading the page causes the new ajax content to disappear and it shows the previous page. Although you can resolve this problem with having some hash tag in the URL, but having hash tag in the url for navigation won’t be SEO friendly.

Do you ever wonder when you are working Facebook or Github in a HTML5 supported browser, when you click on the links, the content is loaded into the page using ajax and at the same time the URL changes in the browser according to the specific page but without hash tag in the URL.

This makes use of the HTML5 History API to change the browser URL without refreshing the page.

Consider a page that has the following links to three menu items and a div to display the ajax content.


<div id="menu">
<a href="menu1.php" rel="tab">menu1</a> |
<a href="menu2.php" rel="tab">menu2</a> |
<a href="menu3.php" rel="tab">menu3</a>
</div>

To override the default action for the link(anchor tag), use the following jQuery code snippet.


$(function(){
$("a[rel='tab']").click(function(e){
//code for the link action
return false;
});
});

Now to get the ajax content and display it and change the browser URL to the specific location without refresh use the following code.


$(function(){
$("a[rel='tab']").click(function(e){
//e.preventDefault();
/*
if uncomment the above line, html5 nonsupported browers won't change the url but will display the ajax content;
if commented, html5 nonsupported browers will reload the page to the specified link.
*/

//get the link location that was clicked
pageurl = $(this).attr('href');

//to get the ajax content and display in div with id 'content'
$.ajax({url:pageurl+'?rel=tab',success: function(data){
$('#content').html(data);
}});

//to change the browser URL to the given link location
if(pageurl!=window.location){
window.history.pushState({path:pageurl},'',pageurl);
}
//stop refreshing to the page given in
return false;
});
});

For this HTML5 History API, the back button functionality won’t work as normal. So we need to override back button to get the ajax content without reloading the page.
To do this add the following code snippet in the page.


/* the below code is to override back button to get the ajax content without page reload*/
$(window).bind('popstate', function() {
$.ajax({url:location.pathname+'?rel=tab',success: function(data){
$('#content').html(data);
}});
});

For the HTML5 History API non supported browsers, those links will reload the page to the specific location. But if its supported, you are lucky; it will get only the required content using ajax and display it without reloading the entire page.

Live Demo | Download Code