Today I continue my discussion of Google’s Webmaster Guidelines and what they mean to making sure your site is search engine (in this case Google) friendly. While the guidelines discussed in the previous newsletter were aimed at how your site is constructed, the guidelines we’ll be discussing today are geared more towards search engine optimization issues. The Guidelines, as presented by Google are in bold with my comments following.
What the average person does not usually know is that the search engine spiders do not necessarily see the same things that the visitor sees. When viewing a site through a browser you see a rendering of the search code into a visible format. The search engines simply see the code and then index what they feel is relevant content. Coded elements that search engines cannot read are not indexed and therefore not counted toward site relevance for your important keywords.
Using a text browser or Lynx viewer, you can see your site as the search engines do. If something is important but does not show up in the text browser, you may want to consider other ways to present that information to the viewer.
Allow search bots to crawl your sites without session ID’s or arguments that track their path through the site. These techniques are useful for tracking individual user behavior, but the access pattern of bots is entirely different. Using these techniques may result in incomplete indexing of your site, as bots may not be able to eliminate URLs that look different but actually point to the same page.
When session IDs are required for viewing a site, each visitor is given a unique URL even when viewing the exact same page. Search engines index sites by going from one URL to another. If the search engine is given a session ID when indexing your site, the next time it visits it will get an entirely different session ID, causing the same “page” to be fed to the search engine through a differing URL. This can cause search engines to believe that you have many duplicate pages on your site. That can be a big problem as search engines penalize sites that feed them multiple pages with the same content.
Allowing users to session IDs is OK and often times necessary, however you’ll want to make sure that the search engines are not forced to do the same.
Make sure your web server supports the If-Modified-Since HTTP header. This feature allows your web server to tell Google whether your content has changed since we last crawled your site. Supporting this feature saves you bandwidth and overhead.
Depending on your web host agreement, if you go over your allotted bandwidth your site either gets shut down, not allowing any viewers or spiders to see your site and/or you get charged additional fees for exceeding your allotted bandwidth. If your server can communicate with the search spider and tell it that your pages has not been updated since it’s last visit, the spider can simply move on to other pages of your site and get to pages with new, changed or recently updated content, rather than spending time getting information it already has.
Make use of the robots.txt file on your web server. This file tells crawlers which directories can or cannot be crawled. Make sure it’s current for your site so that you don’t accidentally block the Googlebot crawler.
Robots.txt files are useful for blocking search engines from indexing files that you simply don’t want made public on the internet. If you are worried about somebody stealing your images, you can block them from Google’s image search.
Unfortunately, this file only applies to search engine spiders so blocking that content from indexing won’t block it from visitors who know how to look around and get access to this information.
If your company buys a content management system, make sure that the system can export your content so that search engine spiders can crawl your site.