The new natural search spam

Feb 01, 2009 10:30 PM  By

I needed a backyard fire pit, so I revved up the Home Depot Website. The retailer’s navigation saved me time by presorting and filtering the hundreds of fire pit options on relevant attributes, like price, size and features. I quickly found the perfect fire pit, picked it up from the local store and had a backyard party that evening.

Many merchants have similar attribute-based or faceted navigation systems on their Websites to make it easier for customers to filter, sort, navigate and buy. But in this case, what’s good for users is not so good for search engines — or for your bottom line.

Google in August pronounced the “infinite filtering” and “resulting page duplication” produced by such faceted navigation systems as negative for bots. Google urged that the duplicated pages be cleaned up, as the number of possible combinations of filters can grow exponentially: “This can produce thousands of URLs, all finding some subset of the items sold. This may be convenient for your users, but is not so helpful for the Googlebot, which just wants to find everything — once!”

In other words, while useful for consumers, faceted navigation as normally implemented actually confuses the search engines by generating massive numbers of URL permutations that contain mostly duplicated content. This is bot spam.

Let’s look closer at my Home Depot example. The product information page for the “Catalina Creations Celestial Cauldron Fireplace” is duplicated at 466 URLs in Google’s index, based on different combinations of parameters representing brand, category, subcategory, store and more. The “Fire Pits” category, the primary category this product is linked from, is duplicated 99 times in Google’s index.

Each page URL produced by the faceted navigation system has a distinct arrangement of parameters. These reveal the navigational pathways Googlebot crawled to arrive at the same page content. Each category and subcategory has many facets — variations on color, price, brand — that can be listed in different orders to create more URL variations.

Now consider that each product can typically be found in multiple categories, subcategories, sub-subcategories, and pagination assortments. Each of these acts as a multiplier, creating more and more unique URLs for content that either is exactly the same or sends the same keyword signals.

Either way, the pages generated by these unique URLs contain duplicated content. We refer to this phenomenon as “crawl leak.” What crawl leak means for Home Depot — and for your business — is that natural search marketing and sales performance suffer in the following systemic ways:

  • It creates self-competition between your site’s own landing pages

    Home Depot may have 460,000 URLs indexed according to a “site:homedepot.com” Google query, but only a fraction are unique landing pages. Which of Home Depot’s 466 fire pit pages should Google consider as the “authority” page to show its searchers?

  • It wastes your available “crawl equity.”

    Of the millions of page requests search bots make to your site each month, the vast majority are wasted on duplicate pages that are unfit to acquire incremental searchers, while the authoritative pages get shorted.

  • This fragments your available link popularity (PageRank) among many different versions of the same/similar landing pages rather than aggregating it toward single authoritative pages

    Home Depot’s powerful PR 8 home page importance score quickly evaporates within one click. Needless to say, none of the 466 “fire pit” pages have any PageRank importance. Ouch.

  • Worse, many of the indexed landing pages are irrelevant for searchers

    Some attributes, like price, are seldom used in search queries. Yet Google reports more than 1,700 matching Home Depot “fire pit” pages just within the $100 to $200 price range. Not only does this contribute to the duplicate page problem, it’s also generating a page that’s accidentally optimized for a nonexistent search market (since the price range filter doubles as the backlink anchor text, and is listed prominently in the landing page title).

This systematic page duplication is like the devaluation of a currency, preventing Home Depot even from realizing the full value of related SEO investments such as content enhancement, HTML improvements, and link building.

Google reports that over 400,000 searches occur each month for the search phrase “fire pit.” Home Depot is likely capturing something just north of 0% of this natural search market right now. Obtaining even a modest 5% click-through rate on this market would produce 20,000 qualified searchers. Converting 3% at $200 average order value equals $120,000 of incremental revenue per month.

Learning to crawl

Breaking into these keyword markets requires cleaning up the search engine spam produced by faceted navigation systems like Home Depot’s. This involves re-engineering the site pages toretain the great usability benefits while maximizing the quantity and relevance of unique landing pages available. You have to decide which URLs to suppress and which to promote for search engines.

Ideally, the way to understand the impact that faceted navigation will have on the URLs is to crawl the site while it’s still in the testing environment. The scale of these crawl leaks is nearly impossible to predict without a crawl.

Crawling the site before launch lets you analyze how many URLs are created and what form they will take. Be warned, though: A crawl for a site that contains faceted navigation across 10,000 product SKUs is likely to number in the hundreds of thousands of URLs.

Analyzing such a high volume of URLs requires patience to tease out the patterns and determine which URLs should be made dominant. Is the same title tag duplicated across multiple pages? Is the same product ID number present in many different URLs?

Look for patterns like these to find potential areas of content duplication. URL crawl frequency should also be a deciding factor, as higher frequency indicates more inbound linkages, and therefore a page more likely to be dominant. Once you identify the dominant pages for each category, subcategory and product page, decide which tactics to use to suppress the nondominant pages for natural search while continuing to enable them for human users.

Research the words used to create the faceted navigation scheme during the design phase as well, because they define the keyword theme of the resulting landing pages. Home Depot’s “Outdoor” category page lists “Fire Pits” as the link to the subcategory page; 250% more searches are conducted for “fire pit” (singular) than “fire pits” (plural).

You should also evaluate faceted filter links that are not beneficial to natural search performance (such as price), and make these landing pages uncrawlable for bots. You can do this using a combination of nofollow tags, meta noindex page tags and/or robots.txt commands. But it can be quite tedious work to change each page. Often algorithmic rules need to be developed by IT to save time deploying — and then monitoring — these changes.

If you have launched faceted navigation without a focus on plugging crawl leaks, your cleanup job will be more difficult and time consuming to manage, but not impossible. Tactics like disallows and meta noindexing can take up to 12 months to achieve desired de-indexation effects, and all these steps need to be actively monitored for effectiveness. Or you can outsource this work and deploy a new set of landing pages that apply the needed changes automatically for search engines and their users, while leaving the faceted navigation structure as is for all other site users.

During the cleanup period, consistent tracking of basic natural search data is critical to determine progress. The team should track indexation of the critical URL types across all major engines, backlink data, PageRank flow, natural search referred traffic, and natural search referred sales. More advanced KPIs should be monitored as well.

Look for a decrease in the total number of URL types crawled and indexed over a comparable timeframe, before and after. Naturally, if you do not see a decrease, you need to be able to segment URLs crawled by Googlebot, or other bots, to diagnose further crawl leak issues.

Once you remove duplication, you should see all your remaining landing pages crawled with more frequency than before; a greater quantity of your remaining landing pages driving search traffic; and more traffic per page as a result of higher rankings of each page in relevant keyword markets. That means significant increases in brand visibility and higher margin sales.

You can see that the process of eliminating crawl leaks and duplication induced by faceted navigation is advanced work. But done right, you can make faceted navigation work harder for you to produce not only a stronger natural search presence, but increased online sales as well.

Brian Klais (brian@netconcepts.com) is executive vice president, search, for natural search marketing firm Netconcepts.