There are two kinds of duplicate content: content that is duplicated on multiple websites sites and content that is duplicated on multiple pages of a single site. I believe the search engines treat each differently and, of course, there may be different standards applied duplicate content within each of these two main differentiations, depending on the cause and instance.
Please note that I’ve not done any in-depth testing of this issue so everything I’m presenting here are my own theories. But I think as far as untested theories go, it’s pretty solid.
Multiple Site Duplicate Content
Let’s first tackle the issue of content duplicated on different sites across the web. Within this segment of duplicate content there are two obvious types; duplicated articles (or other types of lengthier content) and duplicate product descriptions.
Many ecommerce sites use nothing more than the default manufacturers descriptions to populate their content pages. They might throw in a custom title and description, but many times the description info is left intact. When this is the case, how do the search engine determine the relevance of one site’s product over another?
In these instances I believe the weight of the site itself, and the overall number and quality of backlinks tend to be the primary factor. Given similar content with another site, a site that is more well-known, has a larger user audience, and a better backlink structure, is likely to trump any other website.
On the other hand, a site that provides unique product descriptions suddenly has an advantage. Links and popularity will still come into play here, so the big site with dupe content may still achieve better rankings. But the site with unique content will undoubtedly outperform sites that are on the same “stature” level, and perhaps sites that are one or two rungs higher in stature, assuming, of course, that the higher stature sites are using duplicate product descriptions.
The search engines should favor the sites that take the time to develop unique content, over those that don’t, barring any other factors that might come into play.
The other type of duplicate content between different websites is longer content or articles. This comes into play with article distribution sites, scraper sites or other blogs that are republishing content. Some of this is against the content originator’s will, but not always. I’ll hold no distinction between the two here.
I don’t have a firm theory on what actually happens in these cases, but I agree with many others who have spoken on this as to what the search engines should be trying to do. It would seem that it would not be too difficult to find the original, or canonical, version of any piece of content. They can do this a couple of ways which would probably identify the originator of 90% of all duplicate content.
One way is to simply look at the cache date. If they cached the content on site X first, then when it appears on site Y, they know it’s duplicated. This way, of course, assumes that the originator always gets cached first, which is not always the case.
A second way is to look for an author’s name, or a link that points back to the author’s site. If I republish this piece on another website (and I’m sure I will), I’ll have a link back to my site in my author’s bio. The search engines can simply look for this link and if it goes back to a site that does in fact contain this “duplicated” information, then the engines can know which is the original version.
I suspect that they employ both of these methods, as well as others that I have not mentioned here, to make their determination on which information is canonical. While the second method does not address stolen content, the first method will most likely be able to be used to determine the content originator.
A couple of years ago I asked a question on this topic, in regard to passing link value to originating sources, to a group of search engine engineers. I never received a satisfactory answer.
My question was that if there are two pieces of identical content and the search engines clearly know which one came first, do links pointing to the duplicated version count as links to the original version? Part of the answer here would be obvious in that if the duplicated version contains a link back to the first then the first will get some second-hand link value. But I wanted to know about the passing of first-hand link value. Much to my disappointment, the search engineers refused to answer my question.
What I would like to see, is that in cases when the search engines are confident of the canonical version of a piece of text, that all links to the duplicates should (at least in part) be attributed back to the original. The originating source should get the lions share of the link value, despite where that content is duplicated. This would allow the original content to gain more traction against duplicates on sites that have significantly more power and weight.
In-Site Duplicate Content
Again, in-site duplicate content is content that is duplicated on one or more pages within a single site. I think this type of duplication is that which is most prone to receiving any kind of penalty from the search engines. But penalty might not be the best word to use in most cases. I think what happens here is that search engines simply treat your site differently than they would if it didn’t have any significant duplicate content problems.
The type of in-site duplicate content that most often appears in ecommerce sites, is when the same product is given multiple URLs, depending on the navigation path. I’ve seen sites that create up to three URLs for every single product page. This type of duplication poses a real problem for the engines. A 5,000 product site suddenly becomes a 15,000 product site. But as the search engines spider and analyze, they realize that they have 10,000 too many pages in their index due to the duplication of the product pages to different URLs.
When this type of duplication is found, the search engines will often slow down or even stop spidering your site. The duplication has created an undue burden on the engines and since they are not getting much in the way of new content (in relation to pages being spidered), they have no compulsion to continue. This leaves many pages of your website out of reach of the search results.
Such duplication also leaves you open to splitting link value between multiple URLs. If someone links to a particular product, they may link to any of the multiple versions, instead of a single primary version/URL. This can cause the search engines to give weight to the “wrong” URLs.
Many people will “fix” this problem by preventing the search engines from indexing all but a single version of the content. Keep in mind though, that this only keeps the duplicate pages out of the search index, but does nothing about the link splitting issue. As long as those duplicate URLs exist, link value splitting will be an issue.
The best solution here is to find a way to resolve the duplicate content issue all together. Don’t let the navigation path determine the URL for any given product. Let there be only one URL for each product, regardless of how the visitor navigated, or how many categories the product fits in.
Another duplicate content issue is when short product description summaries are being displayed throughout multiple category type pages. Let’s say you are looking for a Burton Snowboard. You click on the Burton products link and then click on snowboards. This leads to a page with various Burton snowboards, each displaying a short product description/summary. But when you then navigate to the main snowboards page, which carries products from Burton and other companies, you find the same Burton product descriptions along with duplicate product descriptions for all the other products as well.
I’m not entirely certain how the search engines react to this kind of duplication, but I don’t imagine that they would give the page a whole lot of weight. A solution is to make sure that each of these pages has at least a single paragraph of unique content. This way the search engines can freely choose to ignore the obvious duplicate product descriptions, but still have something of value on the page worthy of being indexed and followed.
There is a time for (cautious) duplication
When analyzing the value or necessity of duplicate content, you have to evaluate your goals. I frequently allow articles from my blog to be reprinted/reposted/duplicated on other sites. I know I’m creating duplicate content, but my purpose for doing so is exposure. Some of these other sites give me far more exposure than I get on my own blog.
You can probably argue that if I never duplicated content, then my blog would get a lot more exposure than it does now, and I’ll concede that possibility. But I also know that it would take significant more exposure to match that which I get from duplicating the other sites. So in this case I’m willing to live with the duplicate content issues that may arise. But, I also take measures to make sure the engines know where the the content originated.
The thing to keep in mind with all of this is that search engines want unique content. It does them no good to serve up ten pages with exactly the same content. So anything you can do to make each of your pages unique, the better off you’ll be.