Google's Webmaster Guidelines: Technical Guidelines

Today I continue my discussion of Google’s Webmaster Guidelines and what they mean to making sure your site is search engine (in this case Google) friendly. While the guidelines discussed in the previous newsletter were aimed at how your site is constructed, the guidelines we’ll be discussing today are geared more towards search engine optimization issues. The Guidelines, as presented by Google are in bold with my comments following.

Use a text browser such as Lynx to examine your site, because most search engine spiders see your site much as Lynx would. If fancy features such as Javascript, cookies, session ID’s, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site.

What the average person does not usually know is that the search engine spiders do not necessarily see the same things that the visitor sees. When viewing a site through a browser you see a rendering of the search code into a visible format. The search engines simply see the code and then index what they feel is relevant content. Coded elements that search engines cannot read are not indexed and therefore not counted toward site relevance for your important keywords.

Using a text browser or Lynx viewer, you can see your site as the search engines do. If something is important but does not show up in the text browser, you may want to consider other ways to present that information to the viewer.

Many common page design elements are either completely or partially unviewable by the search engines including Javascript menu systems, framed pages, dynamic HTML and Flash. Limiting these elements or utilizing these elements will help make sure your site and its important content are available for the search engine to index and analyze.

Allow search bots to crawl your sites without session ID’s or arguments that track their path through the site. These techniques are useful for tracking individual user behavior, but the access pattern of bots is entirely different. Using these techniques may result in incomplete indexing of your site, as bots may not be able to eliminate URLs that look different but actually point to the same page.

When session IDs are required for viewing a site, each visitor is given a unique URL even when viewing the exact same page. Search engines index sites by going from one URL to another. If the search engine is given a session ID when indexing your site, the next time it visits it will get an entirely different session ID, causing the same “page” to be fed to the search engine through a differing URL. This can cause search engines to believe that you have many duplicate pages on your site. That can be a big problem as search engines penalize sites that feed them multiple pages with the same content.

Allowing users to session IDs is OK and often times necessary, however you’ll want to make sure that the search engines are not forced to do the same.

Make sure your web server supports the If-Modified-Since HTTP header. This feature allows your web server to tell Google whether your content has changed since we last crawled your site. Supporting this feature saves you bandwidth and overhead.

Most web hosts allot a certain amount of bandwidth to each web hosting account. Band with is essentially the amount of traffic that you are allotted in a given timeframe, usually per month. Each time a visitor comes to your site and downloads code, images, javascripts, text, video, flash and anything else your web site presents to them it uses a portion of your bandwidth. The larger the files being and the more files downloaded the more Band with is being used. When search spiders visit your site they are using bandwidth as well.

Depending on your web host agreement, if you go over your allotted bandwidth your site either gets shut down, not allowing any viewers or spiders to see your site and/or you get charged additional fees for exceeding your allotted bandwidth. If your server can communicate with the search spider and tell it that your pages has not been updated since it’s last visit, the spider can simply move on to other pages of your site and get to pages with new, changed or recently updated content, rather than spending time getting information it already has.

Make use of the robots.txt file on your web server. This file tells crawlers which directories can or cannot be crawled. Make sure it’s current for your site so that you don’t accidentally block the Googlebot crawler.

Robots.txt files are useful for blocking search engines from indexing files that you simply don’t want made public on the internet. If you are worried about somebody stealing your images, you can block them from Google’s image search.

Unfortunately, this file only applies to search engine spiders so blocking that content from indexing won’t block it from visitors who know how to look around and get access to this information.

If your company buys a content management system, make sure that the system can export your content so that search engine spiders can crawl your site.

I don’t have much personal experience with content management systems or how they operate so I can’t provide too much depth on this particular guideline, however with any system you use for your site, content management, shopping cart, JavaScripts, etc, its important that search engines are able to spider your content and crawl your site. With each system being implemented into your site you will do well to verify that those systems are search engine friendly. If not, find one that is.

Tagged As: Google Search & Marketing

[addtoany]

Head

Form

Lower Head

EBLOG

Google's Webmaster Guidelines: Technical Guidelines

One Response to Google's Webmaster Guidelines: Technical Guidelines

Pole Position Marketing