Head

Form

Lower Head

EBLOG

E-Marketing Performance Blog

Create Infinite Page Duplication: Use URL Session IDs

There is no better way to create an infinite amount of duplicate content on your site than to force session IDs onto each visitor. Typically, session IDs are used for tracking a single visitor’s navigation path through the site, including the adding or removing products from the shopping cart. They are great for tracking purposes, but really, really bad for search engines and inbound linking.

Session IDs

Ok, first of all, that’s a bad URL shown above, but aside from that, tacked on at the end there is the session ID. Both URLs pull the same page pulled open via a different browsing session. The bad stuff happens if the session IDs also get attached when the search engines come for a visit.

Since a new session ID is attached with each new visit, each time the search engine comes around they are essentially fed all new URLs. If you have only a ten page site, the second time the search engines visit they add the “new” 10 pages to the index, for a total of 20 pages. When they come around a third time they now have 30 pages in their index. Once they start analyzing these pages they find page after page after page of duplication.

An additional problem arises as site visitors start bookmarking and linking to your site. Every link they add contains their very own session ID. The search engines follow that link to your site and now you’ve got another 10 pages of duplication. If they follow another link to your site, that’s 10 more. You starting to see where this is going? Essentially you can turn a 10 page site into endless duplications.

Even with a small site you can see why the search engines would stop coming around. But if you have a site with hundreds, or even thousands of products, you find two things happen. 1) The search engines will stop spidering new pages because there is just too much duplication. 2) The engines will start dropping pages out of the index altogether.

Now this is where my lack of programming skills show. I know there are some systems that will withhold the session IDs from search engines. This still has the potential of creating problems with inbound links. I can’t say for sure how search engines handle incoming links with Session IDs in the URLs, even if those IDs get stripped once the engine hits the site. I would think the link value will pass as if the ID isn’t there, but I don’t know.

Like sex, the only guaranteed protection here is not to do it at all. There are alternate means of tracking users for whatever reason. Avoiding session IDs completely ensures that you don’t open yourself up to inadvertent site duplication.

Related posts on duplicate content:

Stoney G deGeyter

Stoney deGeyter is the author of The Best Damn Web Marketing Checklist, Period!. He is the founder and CEO of Pole Position Marketing, a web presence optimization firm whose pit crew has been velocitizing websites since 1998. In his free time Stoney gets involved in community services and ministries with his “bride enjoy” and his children. Read Stoney’s full bio.

4 Responses to Create Infinite Page Duplication: Use URL Session IDs

  1. Cameron says:

    Agree that it’s ideal to allow Google and other bots to index your site without session IDs, and they confirm this.

    “Allow search bots to crawl your sites without session IDs or arguments that track their path through the site. These techniques are useful for tracking individual user behavior, but the access pattern of bots is entirely different. Using these techniques may result in incomplete indexing of your site, as bots may not be able to eliminate URLs that look different but actually point to the same page.”

    The unfortunate problem with this is that a URL that contains variables like session IDs isn’t a “bad” URL. URL’s by design are SUPPOSED to be able to contain variables. The URL is NOT just to tell you where you are. Just ask Google—they use the URL to remember what you were searching for as you go through pages of search results. So clearly, using the URL for variables isn’t somehow inappropriate behavior.

    Websites are “stateless” technology, meaning that the web server is like a person with amnesia. They’re (purposely) designed to not remember anything, and each new page request between your browser and the web server is, to the web server, something brand new. For “flat” websites that don’t do anything but list text (and pictures), no problem. But virtually any web application—like a blog, a web store, webmail, or even a search engine—needs some other way to track who you are and what you were trying to do from page to page—your browsing “session”. You can just imagine how infuriating it would be if a shopping cart forgot who you were and what you’d put in your cart every time you went to a new page. Cookies are one way to “remember” you but they often get blocked. The URL is another more reliable way. (And as a website programmer, it’s rather infuriating to be told that using the URL for variables is bad when that’s what it’s there for.)

    That being said, SEO is not always compatible with how things “should” work, and getting rid of session ID’s is not the first time I’ve had to do something the hard, convoluted way to keep the search engines happy.

    If your site doesn’t have a shopping cart or any other web application (if you don’t know, just ask yourself if there’s any reason your site needs to remember who the user is), then you probably don’t need them and should eliminate them if you can. But chances are that if session ID’s are showing up in your URL’s, there’s a good reason for it. You can try disabling sessions for known bots and crawlers (ZenCart has a feature that does this, and a pretty comprehensive list of known crawlers), but that’s tricky and requires that you constantly keep up with all the search engine crawlers out there (useragentstring.com has a pretty good list). It’s probably easier to just avoid issuing a session ID until the user does something that requires it, like adding something to their shopping cart, which a search engine bot should never need (or be able) to do.

  2. Stoney G deGeyter
    Stoney deGeyter says:

    “(And as a website programmer, it’s rather infuriating to be told that using the URL for variables is bad when that’s what it’s there for.)”

    Cameron, great comments. Keep in mind that I write from the search engine marketing and usability perspective (failing at the latter more than succeeding ,probably), not from a design and development perspective. When I say certain things are “bad” I mean it strictly in the sense of what is helps or hinders online marketing efforts. I can name a hundred things that a developer can do that would be “bad” for SEO, but would be very appealing to the user. Similarly, you could find a hundred things that us SEOs do that are “bad” from the design and development perspective.

    While we do try to keep the big picture in mind, it’s not always easy. That’s also why you’re one of my favorite developers to work with. You don’t often let us get away with stuff that we should know better, just in the name of good rankings. Where we come at things from a different angle, we can often find brilliant compromises that give each what they want.

    “getting rid of session ID’s is not the first time I’ve had to do something the hard, convoluted way to keep the search engines happy”

    This is the unfortunate side-effect of search engines ruling the internet (and they do). They tell us not to do anything we wouldn’t do if search engines didn’t exist, but not only is that self-serving for them (they don’t want us spamming them) it’s also not logical. A lot of what we do is solely for the search engines.

    Of course, the argument is that if we make the site so it can be spidered and indexed by the search engines then it helps the visitors find and navigate the site too. That’s a valid and legitimate argument. Yeah, sometimes SEO requires some creativity in getting systems to work “properly”. The alternative is having the site that looks good and functions great for visitors at the expense of being found by the search engines. It’s all about balance.

  3. Cameron says:

    Stoney, I agree entirely, and it sucks that Google has (just lately, it seems) put the people who market sites (you) and the people who create sites (me) into any sort of adversarial position.

    Google has over the years been a web designer’s best friend. Most often, they’ve compelled people to go back and do things the “right” way, correcting a lot of the design excesses of the 90’s, forcing people to create content-based sites, use HTML meaningfully and correctly, avoid gratuitous use of Flash, JavaScript and images, etc., all of which is refreshing, because I don’t have to be the one to harp on it. That, plus the fact that they’ve actively and vocally encouraged website owners to create sites for people and not to do anything special just for search engines (we’re just going to sit quietly in the audience, just pretend we’re not here) makes me, the webmaster, feel a great deal of affection for Google…I feel like we’re on the same team, with the same goals.

    But lately, that’s been changing. The rel=nofollow thing for instance…a proprietary HTML tag invented by Google just for SEO? XML sitemaps just for search engines that people never see? And now, discouraging sites from using perfectly reasonable and useful query string variables like session IDs. This is one of the few times when what Google wants seriously impairs some normal, totally correct functionality of a website, encouraging people to design their sites to work around the idiosyncrasies of GoogleBot first, their human audience second. Tsk, tsk…I really hope we don’t return to the bad old days when people were employing all sorts of ridiculous (black hat SEO) tricks just for search engine visibility that have nothing whatsoever to do with creating a good and useful site for visitors…this time at the behest of Google.

    They should know better…and if they don’t want to be “evil”, then they need to fix GoogleBot so that it can tell that a page address with a query string like a session ID isn’t some spamful attempt to create duplicate content and artificially manipulate search engine ranking…any more than these two different URLs from Google that pull up the same exact results are “duplicate content”:

  4. Stoney G deGeyter
    Stoney deGeyter says:

    To be fair to Google, they are always working on improving spidering capabilities and whatnot. Several years ago they could barely index dynamic sites, now they can without problem. It’s not always about what Google wants, but about what Google is capable of. Variables on a URL pull up different content so they want to recognize that. But sometimes they don’t.

    We can leave it to Google to figure out which pages are dupes and not provide any penalties, but that requires making the engines think. They can, but they don’t always think correctly or give you the right result. If a page is duped, which one is the “correct” version? The search engine doesn’t know.

    Instead of waiting for engines to figure things out us SEOs proactively make it easier for them. This ensures we get better, more correct results quicker.

    Many SEOs don’t use the XML sitemap, preferring to see how the engines spider the site natureally. Many also don’t use the nofollow tag. This is debated in the industry as a good tool or not. We are experimenting with it ourselves, but there are some definite drawbacks to it.

    So while the engines get better and better, they also try and implement things that make their job easier. Sometimes that’s good, sometimes that’s bad. As SEOs, our job is to always try to make things easier for the engines to help us get the results we want. It’s a balancing act.