There is little doubt that duplicate content on a site can be one of the biggest battles an SEO has to fight against. Too many content management systems are designed for and work great with content, but few SEO considerations are included in how that content is implemented throughout the website.
There are two kinds of duplicate content, onsite and offsite. Onsite duplication is content that is duplicated on two or more pages of your own site. Offsite duplication is when the content of one site is displayed on other websites. Onsite duplication is an issue you have control over while offsite duplication may be beyond your control. Both are problematic.
Why is Duplicate Content an Issue?
The best way to explain why duplicate content is bad is to first point out why unique content is good. Unique content sets you apart. It makes you different. It helps you stand out. Why? Because that content is unique to you and you alone.
When you use the same text to describe your products as the next guy, there is nothing that gives you an advantage over the next guy. When you have multiple URLs with the same information, word for word, there is nothing that makes one URL trump the other, and neither performs well.
Duplicate content essentially downgrades your content’s value. Search engines don’t want to send people to several pages that all say the same thing, so they look for content that is unique from everyone else. Unique content helps you compete with your competitors rather than with yourself.
When the search engines begin spidering your site, they pull the page’s content and put it in their index. If they start seeing page after page of duplicate content as they analyze the content of those pages, they decide to use their resources somewhere else. Perhaps on indexing unique pages on your competitors’ sites.
When you have internal site duplication, the self-competition is at its worst when you have particularly link-worthy content. Each duplicate URL of content may receive links, giving neither page the full value of the link juice pointing at that valuable content. When that content is located only on one URL, all links pointing to that content are consolidated onto a single page enhancing the authoritative value of the page as a whole.
Dealing With Offsite Duplicate Content Problems
Offsite duplicate content has two main sources you can blame: it’s either your fault or someone else’s! At its core, it is either content you stole or content someone stole from you. Whether legally, with permission or without, offsite duplicate content is likely hurting your site from performing better in the search engine rankings.
Content Scrapers and Thieves
The worst content theft offenders are those that scrape content from across the web and publish it on their own sites. The result is generally a Frankensteinian collection of content pieces that produce less of a coherent finished product than the green blockhead himself. Generally these pages are designed solely to attract visitors and get them to leave as quickly as possible by clicking on the ads scattered throughout the page. There isn’t much you can do about these types of content scrapers, and search engines are actively trying to recognize them for what they are in order to purge them from their indexes.
Not everyone stealing content does it by scraping. Some just flat out take something you have written and pass it off as their own. These sites are generally higher-quality sites than the scraper sites, but some of the content is, in fact, lifted from other sources without permission. This type of duplication is more harmful than scrapers because the sites are, for the most part, seen as quality sites and the content is likely garnering links. Should the stolen content produce more incoming links than your own content, you’re apt to be outranked by your own content!
For the most part, scrapers can be ignored; however, some egregious violators and thieves can be gone after via legal means or filing a DMCA removal request.
In many cases content is published into distribution channels hoping to be picked up and republished on other websites. The value of this duplication is usually one or more links pointing to the author’s website. Much of the content I write for our E-Marketing Performance blog is duplicated on other blog sites. This is strategic duplication, and I have to weigh carefully the pros and cons.
Each time my articles are posted somewhere else, I get a link back to my site. These links are incredibly valuable. I also get much wider exposure than I do on my own blog, allowing me to expand my reach far beyond my own natural borders. By keeping this duplication to a minimum, rather than en masse, I’m not at risk of creating mass off-site duplication that tends to hurt sites the most.
The downside is that whole duplicate content thing. I am no longer the sole holder of my content, which means I am potentially taking traffic away from my site and driving it to these other blogs. In fact, since many of these sites have more authority than my own, they often come up first in the search results above my own site.
But this is a case where the pros outweigh the cons. At least for now. That may not always be the case.
The search engines make noise about finding the “canonical” version of such duplication to ensure the original content receives higher marks than the duplicate versions, but I have yet to see this play out in any kind of meaningful way. Years ago I asked a group of search engine engineers a question about this. My question was that if there are two pieces of identical content and the search engines clearly know which one came first, do links pointing to the duplicated version count as links to the original version?
It would be great if this was in fact the case. I’d be happy even if the search engines split the link juice 50/50 between the duplicate site and the original site. Of course, that would also have to include social shares as well as links, but it is certainly something the search engines can do to reward original content over republished duplicate content, regardless of purposeful or nefarious intent.
Generic Product Descriptions
Some of the most common forms of duplicate content are through product descriptions. Thousands of sites on the web sell products, many of them the same or similar. Take for example any site selling books, CDs, DVDs or Blu-Ray discs. Each site basically has the same product library. Where do you suppose these sites get the product descriptions from? Most likely the movie studio, publisher, manufacturer or producer of the content. And since they all, ultimately, come from the same place, the descriptive content for these items is usually 100% identical.
Now multiply that across millions of different products and hundreds of thousands of websites selling those products. Unless each site takes the time to craft their own product descriptions, there’s a enough duplicate content to go around the solar system several times.
So with all these thousands of sites using the same product information, how does a search engine differentiate between one or another when a search is performed? Well, first and foremost, the search engines want to produce unique content, so if you’re selling the same product but you write a unique and compelling product description, you have a greater chance of pushing your way higher in the search results.
But left with no other factors to explore, the search engines have to look to the website as a whole. In these instances, the weight of the site itself and the number and quality of backlinks tend to be the strong factor. Given similar content with another site, a site that is more well-known, has a larger user audience, a better backlink structure and stronger social reach is likely to trump any other website.
Sites that provide unique product descriptions do have an advantage; however, unique content alone isn’t enough to outperform sites that have a strong historic and authoritative profile. But given a site of similar stature, unique content will almost always outperform duplicate content, providing the opportunity to grow into a stronger and stronger site. It takes time, but original content is the key to overcoming the pit of duplicate content despair.
Dealing with Onsite Duplicate Content Problems
The most problematic form of duplicate content, and the kind that you are most able to fight, is duplicate content on your own site. It’s one thing to fight a duplicate content battle with other sites that you do not control. It’s quite another to fight against your own internal duplicate content when, theoretically, you have the ability to fix it.
Duplicate onsite content generally stems from bad site architecture or, more precisely, bad website programming! When a site isn’t structured properly, all kinds of duplicate content problems surface, many of which can take some time to uncover and sort out.
Those who argue against good architecture usually cite Google propaganda about how Google can “figure out” these things and therefore can eliminate them from being an issue for your site. The problem with that scenario is it relies on Google figuring things out. Yes, Google can determine that some duplicate content shouldn’t be duplicate and the algorithms can take this into account when analyzing your site. But that’s no guarantee they will uncover it all or even apply the “fix” in the best way possible for your own site.
Just because your spouse is smart isn’t license for you to go around acting like a dumbass. And just because Google may or may not figure out your problems and may or may not apply the proper workarounds is no excuse for not fixing the problems you have. If Google fails, you’re screwed. So the less you make Google work for you, the better Google will work for you.
Here are some common in-site duplicate content issues and how to fix them.
The Problem: Product Categorization Duplication
Many sites use content management systems that allow you to organize products by categories. In doing so, a unique URL is created for each product in each specific category. The problem arises when a single product is found in multiple categories. The CMS, therefore, generates a unique URL for each category that product falls under.
I’ve seen sites like this create up to ten URLs for every single product page. This type of duplication poses a real problem for the engines. A 5,000 product site suddenly becomes a 50,000 product site. But as the search engines spider and analyze, they realize that they have 45,000 duplicate pages!
If there was ever a reason for the search engine spider to abandon your site while indexing pages, this is it. The duplication creates an unnecessary burden on the engines, causing them to expend their resources in more valuable territory and leaving you out of the search results for a large number of pages.
Below is a screenshot I took several years ago from The Home Depot’s website. I found a particular product by navigating down two different paths. A book like this could easily be tied to several different categories, each one producing a unique URLand, therefore, a duplicate page of content.
Keep in mind that just because the navigation path is different, all the content on the page is 100% identical, save perhaps for the actual breadcrumb trail displayed at the top of the page. If ten people linked to each of these pages while a competitor got the same ten links, but to a single URL, which one do you think would top the search results? You guessed it, the competitor!
The Solution: Master Categorization
An easy fix to this kind of duplication is to simply not allow any product to be found in more than one category. But that’s not exactly good for your shoppers, as it eliminates ways for this important product to be found by those who may need it if they are in an alternate category.
So, in keeping with the ability to tag products to multiple categories, there are a couple of options for preventing duplicate content. One is to manually create the URL path for each product. This can be time consuming and lead to some disorganization in your directory structure. The second is to place all products into the same directory, regardless of the product and navigational category system. I’m not a big fan of this as it somewhat destroys your overall site architecture and prevents categorization reinforcement with your product URLs.
In my opinion, the best solution is to have a master category assigned to each product. This master category will determine the URL of the product. So, for instance, the products below could be assigned to each category so the visitor can have multiple navigational paths to the product, but once they arrive the URL will be the same, regardless of how they found it.
Many programmers attempt to “fix” this problem by preventing the search engines from indexing all but a single URL of each product. While this does keep duplicate pages out of the search index, it doesn’t address the link splitting issue. So any link juice to a non-indexible URL is essentially lost, rather than helping that product rank better in the search results.
Band-Aid Solution: Canonical Tags
Some content management systems won’t allow for the solution presented above. If that’s the case you have two options: find a more search-friendly and robust CMS or implement a band-aid solution. Canonical tags are just that type of solution.
Canonical tags were developed by the search engines as a means to tell the engine which URL is the “correct” or canonical version. So, in our examples above, you choose which URL you want to be the canonical URL and then apply the canonical tag into the code of each and every other duplicate URL product page.
<link rel="canonical" href="http://www.thehomedepot.com/building-materials/landscaping/books/123book" />
In theory, when that tag is applied across all duplicate product URLs, the search engines will attribute any links pointing to the non-canonical URLs to the canonical. It should also keep the other URLs our of the search index forwarding any internal link value to the canonical URL as well. But that’s only theoretical.
In reality, the search engines use this tag as a “signal” as to your intent and purpose. They will then choose to apply it as they see fit. You may or may not get all link juice passed to the correct page, and you may or may not keep non-canonical pages out of the index. Essentially, they’ll take your canonical tag into consideration.
The Problem: Product Summary Duplication
A common form of duplicate content is when short product description summaries are displayed throughout higher-level category pages. Let’s say you are looking for a Burton Snowboard. You click on the Burton link in the main navigation which produces a complete catalog of Burton products and product description snippets, as well as some sub-categories for filtering. You know you only want snowboards, so you select that sub-category – which produces a list of Burton snowboards – and then click on “snowboards.” This leads to a page with various Burton snowboards, each displaying a short product description snippet.
As you continue to navigate through the site, you make your way back to “all snowboards” and find snowboards of all brands along with product snippets – including the same snippets you’ve already seen twice for the Burton snowboards!
Category pages can be great pages to attain broad-level search engine rankings (i.e. searches for “burton snowboards”). However, most product category pages like this contain nothing more than product links, each with a short description summary that is duplicated on page after page of product categories. This leaves these pages almost completely valueless!
The Solution: Create Unique Content for All Pages
The goal is to make each product category page stand on it’s own as having valuable content and solutions for visitors. The simplest way to do that is to write a paragraph or more of unique content for each product page. Use this opportunity to extol the virtues of Burton and Burton snowboards. Talk specifics that the visitor may not know about the products in general that may help them make a sound purchasing decision.
If you were to strip the category pages of all products, it should still remain as a page worthy of being indexed by the search engines. At that point, the duplicate content snippets won’t matter, as the content of the page will hold its value despite it.
The Problem: Secure/Non-Secure URL Duplication
E-commerce sites that use secure checkout are prone to a duplicate content problem between their secure and non-secure portions of their website. The result is similar to the multiple URL issue above but with a slight twist. Instead of just the traditional product URL, the search engines also index a secure version of the same URL.
You can see that the key difference here is the “s” at the end of the “http.” That indicates that the URL is supposed to be secure. For the most part, products do not need to be secure. The only pages that need to be secure are those that require sensitive information.
This type of duplication generally happens when visitors move from the non-secure portion of the site to the secure shopping cart, but before they checkout, they move back out and continue shopping. The duplicate issue is created specifically when the links out of the secure shopping cart contain the “https” instead of the “http” in the links.
The Solution: Use Absolute Links
I believe it’s a good idea to link items in a shopper’s cart back to their product pages. However, it’s a tendency of web developers to use relative links rather than absolute links for internal URLs.
For those who don’t know the difference, an absolute link contains the full URL, including the “http://www.site.com,” while a relative link will contain only the information that is required for the browser to find the page (i.e. everything after the “.com”).
Once a shopper is in the secure part of the cart, any and all relative links will automatically link to “https” pages because that part of the URL is assumed based on where the visitor currently is. Using absolute links to point back to your products is required. This forces the visitor to move from “https” back to “http” and doesn’t allow a secure URL to be visited by the shopper or the search engines.
At this point you may be wondering why anyone would use relative links at all. Back before content management systems, pages were coded by hand and actual files were created on the server for each page. This is still true of many sites today. During routine maintenance and site structure changes, page files would be moved around for better organization. Programs such as Adobe Dreamweaver and Microsoft FrontPage allowed you to move files around, and the relative links would change automatically as you did so. This prevented broken links. When absolute links were used, each link had to be changed manually.
Relative links became the defacto type of link to use for this purpose. However, I’m an advocate of absolute links, especially for site navigation and, more importantly, for shopping cart product links. The image below illustrates how you should be linking to and from your shopping cart.
Best case scenario is not to allow search engines into a shopping cart area period. These URLs and pages should be blocked 100%. But even blocking these URLs is not enough. If a visitor moves from the blocked pages to a duplicate (secure) unblocked product page, that page can get picked up by Google’s index. Using absolute links back to the product page prevents these pages from navigated or indexed.
The Problem: Session ID Duplication
Session IDs create some of the worst duplicate content violations imaginable. Session IDs were created as a way to track visitors through a site and allow them to add products to a shopping cart, ensuring it was attached to them and them alone.
With every visit to a site, a unique ID number is appended to the URL, unique for that particular visitor.
That session number follows them through the site and is attached to every URL they visit on the site. Get out your calculators because, we’re gonna do some serious math. Assume you have a 50 page site. Each visitor gets a session ID attached so you have 50 unique URLs per visitor. Assuming you have 50 visitors per day, your 50 page site now has 2500 unique, indexible URLs. Multiply that by 365 days in a year, you’re looking at almost one million unique URLs, all for an itty-bitty 50 page site!
If you were a search engine, would you want to index that?
The Solution: Don’t Use Session IDs
I’m not a programmer so my knowledge in this area is relatively limited. Here’s what I know. Session IDs suck. There are better ways to do what session IDs accomplish without the duplicate content clinging like dog poop on the bottom of your shoe! Not only can other options, such as cookies, allow you to track visitors through the site, they do it far better and can keep track beyond just one single session, though neither are cross-browser compatible!
I’ll leave it to you and your programmer to figure out which tracking option is best for our system, but you can tell them I said session IDs are the wrong answer.
The Problem: Redundant URL Duplication
One of the most basic site architectural problems revolves around how pages are accessed in the browser. Most pages can only be accessed by their primary URL, but in cases where the page is the first page of a sub-directory (virtual or otherwise).
This is illustrated in the image below. Each of these URLs, left unchecked, leads to the exact same page with the exact same content.
This holds true of any page at the top of a directory structure (i.e. www.site.com/page et al.). That’s one page with four distinct URLs creating duplicate content on the site and splitting your link juice.
The Solution: Server Side Redirects and Internal Link Consistency
There are a number of fixes to this kind of duplicate content issue, and I recommend implementing all of them. Each have their own merits but are prone to allowing things to slip through or around. Implementing them all creates an iron-clad duplicate content solution that can’t be breached!
Server Side Redirects
One solution that can be implemented on Apache servers is to redirect your non www. URLs to www. URLs (or vice versa) via your .htaccess file. No need to explain it in detail here, but you can follow that link to get the full scoop. This doesn’t work on all servers, but you can work with your web host and programmers to find a similar solution for the same effect.
This solution works whether you want the www in the URL or not. Just decide which way you want to go and redirect the other to that.
Internal Link Consistency
Once you decide whether your URLs will or won’t use the www, then be sure to use that in all your absolute internal linking. Yes, if you link incorrectly, the server side redirect will handle it. But if for whatever reason the redirect fails, you are now opening yourself up to duplicate pages getting indexed. I’ve seen it happen, where someone makes a change to the server and the redirects no long work. It is often months later before the problem is detected, and then only after duplicate pages have made their way into the search index.
Never link to /index.html (or .php, etc.)
When linking to a page at the top of any directory or sub-directory, don’t link to the page file name, but instead, link to the root directory folder. These links are automatically redirected for the home page using the server side redirect, but it isn’t automatically done for internal site sub-directory pages. Making sure all your links are consistently pointing to the root sub-directory for these top level pages means you won’t have to worry about a duplicate page showing up in the search results.
www.site.com/index.html (or .asp, .php., etc.)
Implementing ALL of these fixes may seem like duplicate content fix overkill, but most of them are so easy there is no reason not to. It takes a bit of time, but the certainty of eliminating all duplicate content problems is well worth it.
The Problem: Historical File Duplication
This isn’t something that most people think about as a duplicate content problem, but it certainly can be. Over the years, a typical site goes through designs, re-designs, development, and re-development. As things get shuffled, copied, moved around, and beta tested, there is a tendency for duplicate content to be inadvertently created. I’ve seen developers change the entire directory structure of a site and upload it without ever removing or redirecting the old original files.
What exacerbates this problem is when internal content links are not updated to point to the new URLs. Going back to the developers I mentioned in the paragraph above, once they rolled out the “finished” new website, I spend over five hours fixing links in content that were pointing to the old files!
As long as these old files remain on the server, and worse, are being linked to, the search engines continue to index these old pages, creating competition for search engine favor between the new and the old pages.
The Solution: Delete Files and Fix Broken Links
If you haven’t bothered to remove the old files from your server, a broken link check won’t do you any good. So start there. Be sure to back up your site, so as not to inadvertently delete a page you need. Once all old pages are removed, start running broken link checks with a program such as Xenu Link Sleuth.
The report provided should let you know which page contains broken links, and where the link points to. Use that to determine the correct new location of the link and fix it. Once you have them all fixed, rerun the broken link check. Chances are it will continue to find more links to fix. I’ve had to run these checks up to 20 times before I was confident all broken links had been fixed. Even then, it’s a good idea to run them periodically to check for anything new.
Not all duplicate content will destroy your on-site SEO efforts; however, some forms of it will definitely prevent you from having a top-notch performing website. The best duplicate content is found on your competitors’ websites, not your own. Give your site a chance to perform by eliminating all forms of duplicate content wherever possible. Replacing duplicate content with unique, purposeful content that has value to the searchers and engines will give you a needed boost against an otherwise non-duplicate-content-free competitor.