According to Google Search Console, “duplicate content generally refers to substantial blocks of content within or between domains that completely match other content or are appreciably similar.”
Technically, a duplicate piece of content may or may not be penalized, but sometimes it can still affect search engine rankings. When there are several pieces of so-called “appreciably similar” (according to Google) content in more than one location on the Internet, search engines will have a hard time deciding which version is most relevant to a given search query.
Why is duplicate content important to search engines? Well, it’s because it can lead to three main problems for search engines:
They don’t know which version to include or exclude from their indexes.
They don’t know whether to direct link metrics (trust, authority, anchor text, etc.) to one page or keep it separate between multiple versions.
They don’t know which version to classify for the query results.
When there is duplicate content, website owners will be negatively affected by loss of traffic and rankings. These losses are often due to a couple of issues:
To provide the best search query experience, search engines will rarely display multiple versions of the same content and are therefore forced to choose which version is most likely to be the best result. This dilutes the visibility of each of the duplicates.
Link equity can be further diluted because other sites have to choose between the duplicates as well. Instead of all incoming links pointing to one piece of content, they link to multiple pieces, distributing the link equity among the duplicates. Since inbound links are a ranking factor, this can affect the search visibility of a piece of content.
The end result is that some content will not achieve the desired search visibility that it would otherwise have.
As for scraped or copied content, it refers to content scrapers (websites with software tools) that steal your content for their own blogs. The content referenced here includes not only blog posts or editorial content, but also product information pages. Scrapers who republish your blog content on their own sites may be a more familiar source of duplicate content, but there’s also a common problem for eCommerce sites: their product description/information. If many different websites sell the same items and all use the manufacturer’s descriptions of those items, identical content ends up in multiple locations on the web. Such duplicate content is not penalized.
How to fix duplicate content issues? This all boils down to the same core idea: specifying which of the duplicates is the “correct” one.
As long as a site’s content can be found at multiple URLs, it should be canonicalized for search engines. Let’s go over the three main ways to do this: using a 301 redirect to the correct URL, the rel=canonical attribute, or using the parameter handling tool in Google Search Console.
301 Redirect – In many cases, the best way to combat duplicate content is to set up a 301 redirect from the “duplicate” page to the original content page.
When you combine multiple pages with the potential to rank well on a single page, they don’t just stop competing with each other; they also create a stronger signal of overall relevance and popularity. This will have a positive impact on the ability of the “correct” page to rank well.
Rel=”canonical”: Another option to deal with duplicate content is to use the rel=canonical attribute. This tells search engines that a given page should be treated as if it were a copy of a specific URL, and all links, content metrics, and “ranking power” that search engines apply to this page should be credited to the specified URL.
Meta Robots Noindex: A meta tag that can be particularly useful for dealing with duplicate content is meta robots, when used with the “noindex,follow” values. commonly called Meta Noindex, Follow and technically known as content=”noindex,follow”, this meta robots tag can be added to the HTML header of each individual page that is to be excluded from a search engine’s index.
The meta robots tag allows search engines to crawl the links on a page, but prevents them from including those links in their indexes. It’s important that the duplicate page is still crawlable, even if you tell Google not to index it, because Google explicitly warns against restricting crawl access to duplicate content on your website. (Search engines like to be able to see everything in case you’ve made a mistake in your code. It allows them to do a [likely automated] “called to trial” in otherwise ambiguous situations). Using meta robots is a particularly good solution for pagination-related duplicate content issues.
Google Search Console allows you to set your site’s preferred domain (for example, yoursite.com instead of http://www.yoursite.com ) and specify whether Googlebot should crawl various URL parameters differently (parameter handling).
The main drawback to using parameter handling as the primary method of handling duplicate content is that the changes you make only work for Google. Rules set through Google Search Console will not affect how Bing or any other search engine’s crawlers interpret your site; you will need to use the webmaster tools of other search engines in addition to adjusting the settings in Search Console.
While not all scrapers will transfer the full HTML of their source material, some will. For those that do, the self-referencing rel=canonical tag will ensure that your site’s version gets credit as the “original” piece of content.
Duplicate content is fixable and needs to be fixed. The rewards are worth the effort to fix. Making a concerted effort to create quality content will result in better rankings simply by removing duplicate content on your site.