揭秘“重复内容处罚”
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
2008年10月7日星期二
发表者: Susan Moskwa, 网站管理员趋势研究员
原文:
Demystifying the "duplicate content penalty"
发表于: 2008年9月12日星期五,上午8: 30
重复内容始终是一个经常被谈论的话题。我们
不断地
发表
关于这方面
的
文章
,人们也在
不断地提出问题
。特别是,我还听到有很多网站管理员担心自己受到了“重复内容处罚”。
在这里请允许我们把这个问题一次性跟大家讲清楚:根本不存在所谓的“重复内容处罚”。至少,也不是大多数人谈论时所认为的那样。
有一些处罚是和抄袭其他网站的内容有关的,比如完全抄袭并且发布其他网站的内容,或者在完全没有提供任何其他附加价值的情况下发布这些抄袭的内容。这些都是我们不提倡的做法,您可以在
网站管理员指南
里找到有关此问题的清晰的论述:
-
请不要创建含有大量
重复内容
的多个页面、子域或者域。
-
请避免使用那种“一个模子印出来”(cookie cutter)的方式创建网站,比如
没有或者很少原创内容
的联属计划。
-
如果您的网站参与
联属计划
,请确保您的网站可提供附加价值。提供独特且相关的内容,使用户首先有理由访问您的网站。
(请注意,我们不希望您从其他网站那里抄袭内容,但是如果其他人抄袭了您的网站就是另外一回事了;如果您担心别人抄袭了您的网站,请您参考
这篇文章
)。
但是我听到的一些担心重复性内容的网站管理员所谈论的并不是抄袭或者域名农场(domain farms);他们讨论的是诸如在同一个域上有多个网址指向相同的内容。比如,www.example.com/skates.asp?color=black&brand=riedell
和www.example.com/skates.asp?brand=riedell&color=black。这种类型的重复性内容可能会对您网站在搜索结果中的表现有潜在的影响,但是它不会使您的网站受到惩罚。下面这段文字来自我们关于
重复内容
的帮助文章:
除非重复内容看起来意在欺骗用户并操纵搜索引擎结果,否则,我们不会对有重复内容的网站采取特别措施。如果您的网站存在重复内容问题,而您又未遵循上述建议,我们会自行以恰当的方式选择在搜索结果中显示的内容版本。
这种非恶意的重复是比较常见的,特别是很多内容管理系统(CMS)缺省条件下对此处理的并不是很好。因此,当人们谈到此种类型的重复性内容会影响您的网站时,并不是因为您可能会因此受到处罚,而仅仅是由于网站和搜索引擎的工作方式所造成的。
大多数搜索引擎都力求保持一定程度的多样性:他们想在搜索结果页上向您展示十个不同的搜索结果,而不是含有相同内容的十个不同的网址。为此,谷歌试着去掉重复的内容从而使用户较少看到这些多余的重复性的内容。您可以在
这篇博客
里了解更多的细节,其中谈到
-
当我们探测重复内容时,比如由网址参数造成的衍生网址,我们会将这些相似的网址放在同一组里。
-
我们会选择我们认为最能代表这一组的网址在搜索结果里进行展示。
-
我们还会对这一组网址的特性进行相应的整理,像链接的受欢迎程度,并将其合并到此代表性网址上。
作为网站管理员,上述过程可能会影响到您的是:
-
在步骤二中,谷歌所认为最具有代表性的网址并不一定和您的想法一致。如果您想控制究竟是www.example.com/skates.asp?color=black&brand=riedell 还是www.example.com/skates.asp?brand=riedell&color=black出现在我们的搜索结果中的话,您或许想采取适当措施以减少您的重复内容。告诉我们哪一个是您比较喜欢的网址的有效方法之一就是将其列入您的
网站地图
(Sitemap) 里。
-
在步骤三中,如果我们无法探测出某一特定页面的所有重复性页面的话,我们在对其页面特性进行整合时就不可能包括所有这些重复性页面的特性。这可能会削弱这一特定内容的排名竞争力,因为他们被分散分配到了多个网址上。
在大多数情况下,谷歌可以很好的处理此类重复内容。然而,对于那些不同域名上的重复性内容,您或许需要再考虑一下。尤其是,当您决定建立一个网站而它的目的从本质上来讲就是内容抄袭和重复的话,如果您的商业模式又依赖于搜索引擎的流量,那么除非您可以给用户带来很多的附加价值,不然对于建立此类网站您还是要三思而后行。举个例子,我们有时听到来自Amazon.com的联盟网站说他们网站上那些完全由Amazon提供的内容很难有好的排名。这难道是因为谷歌想阻止他们卖《
Everyone Poops
》这本书吗?不;这是因为如果他们的网站提供完全一样的内容的话,他们怎么可能会比Amazon的排名更好呢?对于在线购物来讲,Amazon在很多方面具有权威性(对于一个典型的Amazon联盟网站来说更是如此),一般的谷歌搜索用户可能想看到的是Amazon上的原始信息,除非这个联盟网站提供了相当数量的、额外的
附加值
给用户。
最后,想一下重复内容给您网站带宽带来的影响吧。重复内容会造成抓取效率低下:当Googlebot在您的网站上发现了十个网址,在它知道这些网址含有完全相同的内容之前(如上所述,也就是在我们能够对他们进行归类之前),它必须对这十个网址逐一进行抓取。Googlebot耗费在抓取重复性内容上的时间和资源越多,它用来抓取其他内容的时间也就相对变少了。
总而言之,网站上的重复性内容会以多种方式影响您的网站。但是除非您是恶意抄袭造成内容重复,否则这些方式不会构成对您网站的处罚。这也意味着:
-
当您清除了无恶意的重复性内容时,您无须提交重新收录的请求。
-
如果您是一个介于初级到中级经验值之间的网络管理员,您可能不需花费过多精力来担心重复性内容,因为大多数搜索引擎都有方法来处理它。
-
通过澄清和杜绝关于重复性内容处罚的杜撰之说,您可以帮到您的网站管理员同行们!解决重复性内容的方法完全在您的掌控之中,这里有几篇
较好的
文章
您可以
参考
。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2008-10-01。
[null,null,["最后更新时间 (UTC):2008-10-01。"],[[["\u003cp\u003eThere's no penalty for unintentional duplicate content; Google focuses on intent and often consolidates similar content under a preferred URL.\u003c/p\u003e\n"],["\u003cp\u003eDuplicate content can impact your site's performance in search results by diluting ranking signals and influencing Google's URL selection.\u003c/p\u003e\n"],["\u003cp\u003eWebsite owners should focus on providing unique, valuable content and utilize tools like sitemaps to manage unintentional duplication.\u003c/p\u003e\n"],["\u003cp\u003eWhile unintentional duplication is usually handled by search engines, deliberate content scraping or replication can lead to penalties.\u003c/p\u003e\n"],["\u003cp\u003eExcessive duplicate content can strain website resources and crawling efficiency, impacting the indexing of other important pages.\u003c/p\u003e\n"]]],["Duplicate content does not incur penalties unless it's intentionally deceptive. Common instances, like multiple URLs for the same content, can affect site performance but don't lead to penalties. Search engines group duplicate URLs, selecting a \"best\" one for display. Webmasters can influence this choice via Sitemaps. Content scraping and thin affiliate content are discouraged. Duplicate content can cause inefficient crawling. Building a site that duplicates content across domains, without adding value, will also make it more difficult to rank.\n"],null,["# Demystifying the \"duplicate content penalty\"\n\nFriday, September 12, 2008\n\n\nDuplicate content. There's just something about it. We\n[keep](/search/blog/2006/12/deftly-dealing-with-duplicate-content)\n[writing](/search/blog/2007/06/duplicate-content-summit-at-smx)\n[about](/search/blog/2007/09/google-duplicate-content-caused-by-url)\n[it](/search/blog/2008/06/duplicate-content-due-to-scrapers), and people keep asking\nabout it. In particular, I still hear a lot of webmasters worrying about whether they may have a\n\"duplicate content penalty.\"\n\n\nLet's put this to bed once and for all, folks: There's no such thing as a \"duplicate content\npenalty.\" At least, not in the way most people mean when they say that.\n\n\nThere are some penalties that are related to the idea of having the same content as another\nsite---for example, if you're scraping content from other sites and republishing it, or if\nyou republish content without adding any additional value. These tactics are clearly outlined\n(and discouraged) in our\n[Webmaster Guidelines](/search/docs/essentials):\n\n- Don't create multiple pages, subdomains, or domains with substantially [duplicate content](/search/docs/advanced/guidelines/duplicate-content).\n- Avoid... \"cookie cutter\" approaches such as affiliate programs with [little or no original content](/search/docs/advanced/guidelines/thin-content).\n- If your site participates in an [affiliate program](/search/docs/essentials/spam-policies#thin-affiliate-pages), make sure that your site adds value. Provide unique and relevant content that gives users a reason to visit your site first.\n\n\n(Note that while scraping content from others is discouraged, having others scrape you is a\ndifferent story;\n[check out this post](/search/blog/2008/06/duplicate-content-due-to-scrapers)\nif you're worried about being scraped.)\n\n\nBut most site owners whom I hear worrying about duplicate content aren't talking about scraping or\ndomain farms; they're talking about things like having multiple URLs on the same domain that point\nto the same content. Like\n`www.example.com/skates.asp?color=black&brand=riedell`\nand `www.example.com/skates.asp?brand=riedell&color=black`. Having this type of\nduplicate content on your site can potentially affect your site's performance, but it doesn't\ncause penalties. From our article on\n[duplicate content](/search/docs/advanced/guidelines/duplicate-content):\n\n\nDuplicate content on a site is not grounds for action on that site unless it appears that the\nintent of the duplicate content is to be deceptive and manipulate search engine results. If your\nsite suffers from duplicate content issues, and you don't follow the advice listed above, we do\na good job of choosing a version of the content to show in our search results.\n\n\nThis type of non-malicious duplication is fairly common, especially since many CMSs don't handle\nthis well by default. So when people say that having this type of duplicate content can affect\nyour site, it's not because you're likely to be penalized; it's simply due to the way that web\nsites and search engines work.\n\n\nMost search engines strive for a certain level of variety; they want to show you ten different\nresults on a search results page, not ten different URLs that all have the same content. To this\nend, Google tries to filter out duplicate documents so that users experience less redundancy. You\ncan find details in\n[this blog post](/search/blog/2007/09/google-duplicate-content-caused-by-url), which\nstates:\n\n1. When we detect duplicate content, such as through variations caused by URL parameters, we group the duplicate URLs into one cluster.\n2. We select what we think is the \"best\" URL to represent the cluster in search results.\n3. We then consolidate properties of the URLs in the cluster, such as link popularity, to the representative URL.\n\nHere's how this could affect you as a webmaster:\n\n- In step 2, Google's idea of what the \"best\" URL is might not be the same as your idea. If you want to have control over whether `www.example.com/skates.asp?color=black&brand=riedell` or `www.example.com/skates.asp?brand=riedell&color=black` gets shown in our search results, you may want to take action to mitigate your duplication. One way of letting us know which URL you prefer is by including the preferred URL in your [Sitemap](/search/docs/crawling-indexing/sitemaps/overview).\n- In step 3, if we aren't able to detect all the duplicates of a particular page, we won't be able to consolidate all of their properties. This may dilute the strength of that content's ranking signals by splitting them across multiple URLs.\n\n\nIn most cases Google does a good job of handling this type of duplication. However, you may also\nwant to consider content that's being duplicated across domains. In particular, deciding to build\na site whose purpose inherently involves content duplication is something you should think twice\nabout if your business model is going to rely on search traffic, unless you can add a lot of\nadditional value for users. For example, we sometimes hear from Amazon.com affiliates who are\nhaving a hard time ranking for content that originates solely from Amazon. Is this because Google\nwants to stop them from trying to sell\n[Everyone Poops](https://www.amazon.com/Everyone-Poops-My-Body-Science/dp/0916291456)?\nNo; it's because *how the heck are they going to outrank Amazon* if they're providing\nthe exact same listing? Amazon has a lot of online business authority (most likely more than\na typical Amazon affiliate site does), and the average Google search user probably wants the\noriginal information on Amazon, unless the affiliate site has added a significant amount of\n[additional value](/search/docs/essentials/spam-policies#thin-affiliate-pages).\n\n\nLastly, consider the effect that duplication can have on your site's bandwidth. Duplicated\ncontent can lead to inefficient crawling: when Googlebot discovers ten URLs on your site, it has\nto crawl each of those URLs before it knows whether they contain the same content (and thus before\nwe can group them as described above). The more time and resources that Googlebot spends crawling\nduplicate content across multiple URLs, the less time it has to get to the rest of your content.\n\n\nIn summary: Having duplicate content can affect your site in a variety of ways; but unless you've\nbeen duplicating deliberately, it's unlikely that one of those ways will be a penalty. This means\nthat:\n\n- You typically don't need to submit a reconsideration request when you're cleaning up innocently duplicated content.\n- If you're a webmaster of beginner-to-intermediate savviness, you probably don't need to put too much energy into worrying about duplicate content, since most search engines have ways of handling it.\n- You can help your fellow webmasters by not perpetuating the myth of duplicate content penalties! The remedies for duplicate content are entirely within your control. Here are some good places to start:\n - [Avoid creating duplicate content](/search/docs/advanced/guidelines/duplicate-content)\n - [Deftly dealing with duplicate content](/search/blog/2006/12/deftly-dealing-with-duplicate-content)\n - [Duplicate content summit at SMX Advanced](/search/blog/2007/06/duplicate-content-summit-at-smx)\n - [Google, duplicate content caused by URL parameters, and you](/search/blog/2007/09/google-duplicate-content-caused-by-url)\n\nPosted by Susan Moskwa, Webmaster Trends Analyst"]]