Duplicate content is like a virus. When a virus enters your system, it begins to replicate itself until it is ready to be released and cause all kinds of nasty havoc within your body. On the web, a little duplicate content isn’t a huge problem, but the more it replicates itself, the bigger the problem you’re going to have. Too much duplicate content and your website will come down with some serious health issues.
I’m going to break this into three parts. In this post, I’ll discuss the problems that are caused with duplicate content. In Part II, I’ll address the causes of duplicate content, and in Part III, I’ll discuss some duplicate content elimination solutions.
This series is pulled from a presentation given at SMX East. Just for fun, it’s entirely Matrix-themed because, like, it’s so obscure and all.
Duplicate Content Causes Problems. Duh!
Google and other search engines like to tell us that they have the duplicate content issue all figured out. And, in the cases where they don’t, they provide a couple of band-aid solutions for you to use (we’ll get to these later). While there may be no such thing as a “duplicate content penalty”, there are certainly filters in place in the search engine algorithms that devalue content that is considered duplicate, and make your site as a whole less valuable in the eyes of the search engines.
If you trust the search engines to handle your site properly, and don’t mind having important pages filtered out of the search results, then go ahead and move on to another story… you got nothing to worry about.
Too many pages to index
Theoretically, there is no limit to the number of pages on your site that the search engines can add to their index. In practice, though, if they find too much “junk”, they’ll stop spidering pages and move on to the next site. They may come back and keep grabbing content they missed, but likely at a much slower pace than they otherwise would.
Duplicate content, in practice, creates “junk” pages. Not that they may not have value, but compared to the one or two or dozen other pages on your site or throughout the web that also contain the same content, there really isn’t anything unique there for the search engines to care about. It’s up to the engines to decide which pages are the unnecessary pages and which is the original source or most valuable page to include in the search results.
The rest is just clutter that the search engines would rather not have.
Slows search engine spidering
With so many duplicate pages to sort through, the search engines tire easily. Instead of indexing hundreds of pages of unique content, they are left sifting through thousands of pages of some original content and a whole lot of duplicate crap. Yeah, you’d tire too!
Once the engines get a whiff that a site is overrun with dupes, the spidering process will often be reduced to a slow crawl. Why rush? There are plenty of original sites out there they can be gathering information on. Maybe they’ll find a few good nuggets or two on your site, but it can wait, as long as they are finding gold mines elsewhere.
Splits valuable link juice
When there is more than one page (URL) on your site that carries the same content as another there becomes an issue of which page gets the links. In practice, whichever URL the visitor lands on and bookmarks, or passes on via social media, is the page that gets the link value. But, each visitor may land on a different URL with that same content.
If 10 people visit your site, 5 land on and choose to link to one URL, while the other 5 land on and choose to link to the other (both being the same content), instead of having one page that has 10 great links, you have 2 pages each with half the linking value. Now imagine you have 5 duplicate pages and the same scenario happens. Instead of 10 links going to a single page, you may end up with 2 links going to each of the 5 duplicate versions.
So, for each duplicate page on your site, you are cutting the link value that any one of the pages could achieve. When it comes to rankings, this matters. In our second scenario, all it takes, essentially, is a similarly optimized page with 3 links to outrank your page with only 2. Not really fair, because the same content really has 10 links, but it’s your own damn fault for splitting up your link juice like that.
We talked above about how duplicate content slows spidering leaving, some content out of the search engine’s index. Leaving duplicate content aside for a moment, let’s consider the page URLs themselves. We’ve all seen those URLs that are so long and complicated that you couldn’t type one out if it was dictated to you. While not all of these URLs are problematic, some of them certainly can be. Not to mention URLs that are simply undecipherable as being unique pages.
We’ll talk more about these URLs in part 3, but for now, let’s just consider what it means when a URL cannot be spidered by the search engines. Well, simply put, if the search engines can’t spider it, then it won’t get indexed. The browser may pull open a page the visitors can see, but the search engines get nothin’. And when you multiply that nothin’ the search engines get with the nothin’ they’ll show in the results (don’t forget to carry the nothin’), you get a whole lot of nothin’ going on.
Pages inaccessible to the search engines means those pages can’t act as landing pages in the search results. That’s OK, if it’s a useless page, but not if it’s something of value that you want to be driving traffic to.
There are a lot of problems caused by duplicate content and bad URL development. These problems may be minor or cataclysmic, depending on the site. Either way, small problem or large, it’s probably a good idea to figure out the cause of your duplicate content problems so you can begin to implement solutions that will pave the way for better search engine rankings.