Reusing Web Content without Getting Penalized
Web Marketing Today, Issue 126, July 9, 2003
For the July 2, 2003 Doctor Ebiz I received this question:
"Our organization creates huge amounts of content. This content is created and 'owned' by different internal divisions. Much of this content is re-usable across divisions. However, we have heard that allowing the same content to appear on multiple web properties can cause penalties from search engines. How can reuse content without getting blacklisted?" -- Keith Seabourn, Campus Crusade for Christ, International
I shared this question with Mike Grehan, author of the highly-respected Search Engine Marketing: The essential best practice guide. Based on his experience, I answered the question briefly in Doctor Ebiz, but felt it was important to include Mike's full answer for those who are interested:
I've had a chance to review your question re duplicate material. And as is always the case where search engines are concerned -- there's no real "cut'n'dry" answer, I'm afraid.
Search engines are aware that there are many legitimate reasons for uploading duplicate content to the web. For instance, improving access time by providing regional versions of sites, sharing of research data and also the use of common promotional material for intermediaries selling the same products. However, the one area that they really do have concern about is what they term as "pseudo identities". They see this as a method of spamming search engines with seemingly different websites that do in fact have the same content, that is, two websites may be www.discountdvd_.com and www.xxxpass_.com both pointing to the same "adult" material.
One of the world's leading scientists in the field of information retrieval on the web has conducted the most authoritative research into detecting duplicate material online and his methods have been adopted very successfully by the major search engines. I'd rather not explain in detail exactly how these methods work in an open forum such as your newsletter, as this may also lend itself as a lesson on how to do better spamming. Suffice to say that, search engines can detect duplicate material from the most obvious analysis such as clusters of pages which are byte-wise identical or even just very similar. In the main, they view duplicate content as being a high percentage of paths -- that is the portion of the URL after the hostname or the file name -- which are present on more than one website. And more so if the content under those paths links to documents which have other similar content, such as duplicate page content with exactly the same outbound links.
It's safe to say that, in the widest definition, search engines do not consider hosts that replicate content but rename paths as "mirrors". Therefore, for the purpose of syndicating same/similar content, the best practice is to always rename the page file names under whichever server they are being hosted. Of course, many similar pages, even with different filenames, being hosted under the same IP address is likely to also set the "alarm bells" ringing.
To be completely safe in the knowledge that search engines will not penalise you or drop you from their database for reproducing content across a number of servers, ideally you would simply use the 'robots.txt' protocol to avoid this material being indexed by search engines in the first instance (http://www.robotstxt.org/wc/exclusion.html).

