The best information here cam from Thurow (I have to look more into this whole “shingles” thing and from Cuts and Converse at the end. The biggest thing to walk away from is that search engines are getting better at working out inadvertent duplicate content but you should be pro0active in helping them because that will help you.
Anne Kennedy, Beyond Ink
Duplicate content is One website, two domains. Most dupe content is inadvertent. Choose one domain, 301 redirect the others.
Shari Thurow, GrantasticDesigns.com
Unclear what duplicate content really is. Could be exact or mere duplicate. SEs don’t want to display dupe content because it interferes with retrieving from the index and makes search experience negative.
The “repeat search with omitted results” at bottom of Google search is dupe content filter in action. Search engines remove boilerplate (common headers and footers) and then analyzes the main content area. Each page should have unique linkage properties.
Having a press release on your site and then on a press release site is NOT duplicate content because the linkage properties and boilerplate is different.
65% of web content does not change on a weekly basis. News sites home page changes frequently, but the news content does not.
Don’t move content from server to server too often. Associated with spam.
All search engines us “shingles” when analyzing for duplicate content. Shingles are words or word sets which appear variously on multiple pages.
Use robots.txt to filter out dupe content on your own site. Archive your own site, keep records of content and register with copyright.gov. Monitor frequently.
Mikkel deMib Svendsen, RedZoneGlobal
Deal with dupe content issues BEFORE the search engines do.
With/out WWW issue: Direct all links to www OR non-www version of your site. Both internal and external links. Set a 301 redirect so the unwanted pages cannot be accessed.
Session IDs: Session IDs create hundreds, if not hundreds of thousands of “versions” of the home page (or any page with the session ID). Put all info in a cookie and dump session IDs out. Next best solution is to identify SE spiders and strip out the Session ID info for them.
Permanent link structure. Blogs and forums will often have different URLs to access the same pages. 301 redirect all unused URLs. WordPress has a canonical URL plugin.
Unique pages should have only one address.
Matt Cutts, Google
SEs want as much diversity in results as they can. Don’t worry about dupe sites for various countries or languages. HTML, Word and PDF docs with dupe content would not be considered duplicate content. Recommends picking a version of the site (www or non-www) and be consistant.
Tim Converse, Yahoo
Most of the examples presented are inadvertent duplication of content. Search engines can sort out most of those ways so don’t worry abut getting banned. The more we help them, the more the SEs can crawl and index the site without worrying about the dupe pages. Just stay away from deliberate abuse (multiple sites or repurposing someone else’s content)