On the Right Page for Web Indexing

If you’re a consumer cataloger, you’re probably doing all you can to gain incremental holiday sales. Of course, even if the fourth quarter isn’t your major season, you’re probably doing all you can to gain incremental sales. Getting your entire Website indexed on the major Web search engines is one way to do so.

In recent months, in fact, Google has become much more aggressive in its crawling behavior, going deeper into dynamic, database-driven Websites than ever before. This is good news for online catalogers. CompUSA.com, for example, went from having a few hundred product pages in Google to more than 100,000. RadioShack.com went from fewer than 100 product pages early in the year to 15,000 pages as of late summer.

And Yahoo!’s crawler, in many instances, is even more aggressive than Google’s. CompUSA.com has twice as many pages in Yahoo! than in Google; RadioShack.com has four times as many.

A closer look, however, reveals that some bad news is mixed in with the good. Many of these indexed pages are actually duplicates. The Lands’ End Website, for example, has 26,000 pages indexed. But its top-level category pages (Men, Women, For the Home) are duplicated more than 2,000 times each on account of the session ID — the unique identification code assigned by the user’s computer to each URL it accesses — passed in those URLs.

Even the Google Store, ironically, has suffered this fate in Google, with tens of thousands of copies of the same small handful of pages clogging up Google’s index since late last year. Now that the Google Store has removed the session IDs from its URLs, this problem should soon subside, but countless other sites are in a similar pickle, with Yahoo! as well as with Google.

In addition, our research on the Catalog Age 100 (Catalog Age’s exclusive ranking of the nation’s largest offline and online catalogers) shows that most of these top catalogers still have less than 20% of their product content indexed. This is much better than the 5% average we found from our research published early last year in the Catalog Age report “The State of Search Engine Marketing 1.0 — Strategies for Successful Cataloging,” but still a far cry from total representation of the catalogers’ product inventory.

The upside available for those who clear the hurdles and get fully indexed is huge. Anecdotal research of our retail clients finds each indexed page to be worth 10-25 visitors a month in incremental search traffic. So say you have 5,000 available products online, but 4,000 of them are not indexed. These hidden pages have “search value” that we would estimate to be worth 40,000-100,000 more monthly visitors than what your site now attracts. What’s more, it’s reasonable to expect a large proportion of these visitors to be new buyers. Northern Tool, after an initiative to unlock more than 10,000 product and category pages that were previously unavailable to crawlers, enjoyed a significant bump in its online sales, with 75% of these search-engine-delivered customers being new-to-file.

URL “fixes”

The purest way to ensure that everything on your Website gets indexed by Google and Yahoo! is to optimize your URLs. Most online catalogs are database-driven and thus have problematic “stop characters” — question marks, ampersands, equal signs — in the URLs of their Web pages. Even worse, a significant percentage of sites embed session IDs, user IDs, and other “flags” in their URLs.

Going back and recoding your programming and Website architecture to eliminate these problematic elements from your URLs can prove daunting, however. There can be many interdependencies in the programming that will have to be sorted out. For instance, removing a flag such as “photos=1” from a URL could yield an ugly error message on the page until your programmers go in and revise the code.

The ideal option is to consider a “URL rewriting” Web server module, such as Mod_rewrite for Apache and ISAPI_Rewrite for Microsoft IIS. Using even one of these progams can take months, though, depending on your Website’s level of infrastructure complexity — for in addition to server configuration and “rule writing” there are page templates throughout the site that must be modified to link to the new URL format. And don’t forget the testing required to make sure everything still works flawlessly.

If commandeering the IT resources to revise your site code is out of the question, particularly with holiday “code freezes” looming, there are other options to consider. Unfortunately none of them are optimal, because they don’t fix the core issue — that your site appears to be fundamentally broken from the standpoint of the search engine crawlers. Nonetheless, one of these options could be a viable stop-gap measure.

One alternative is to publish a comprehensive set of “static” pages that form a frozen-in-time replica of your online catalog that you can feed to the crawlers. These static pages would be given file names devoid of stop characters. But while static pages are generally more crawlable, this approach negates the benefits of a database-driven Website, since the collection of pages in effect becomes a separate site that you must constantly maintain. The moment you publish these pages, the pricing and inventory availability data risk falling out of sync with the information on the online catalog, since static pages are by their very nature not real-time.

In addition, these pages are like adjunct appendages to your Website. As such, they are difficult for people to navigate to, making it highly unlikely that other sites will link to them and thus impart a high amount of link popularity to influence their ranking. PageRank, Google’s secret sauce, is essentially an importance score for any given page, based on the number of inbound links, with a weighting factor applied to each link. PageRank plays a critical role in determining where a page appears in Google’s search results.

Another possible workaround is to create a massive set of site maps, broken up into many pages with links to all your category pages and product pages. This approach has proven somewhat successful because spiders are more likely to crawl dynamic pages that are linked to from a static page. The reduced maintenance required makes this approach better than creating a collection of static pages. It still is not optimal, though, mainly because this method doesn’t result in high PageRank scores. (Remember, the goal is not just indexation but top rankings that can drive traffic.)

Doling out PageRank strategically happens naturally on your Website, by virtue of your site’s hierarchical internal linking structure and the way the products that you most want to sell are featured on top-level pages. But a multipage site map is a haphazard approach for passing on PageRank, since key categories and top-selling, high-margin products aren’t typically assigned a larger share of the PageRank. Furthermore, links to your products that occur elsewhere in your site outside of this site map still aren’t counted as “votes” that increase PageRank score of those products.

There is a third method, which we call proxy serving. It offers unique benefits over previously discussed approaches. Proxy serving is a style of hosted URL rewriting that marketers such as Sara Lee Direct, Northern Tool, and Ritz Camera have been experimenting with and seeing promising results from. Proxy serving relies not on suboptimal static page forms or draining scarce IT resources to reconfigure the Website. Rather, it acts as a middleman, intercepting pages being requested by search spiders and then delivering crawler-friendly versions of the pages. When a searcher finds and clicks on a listing for a particular product, the proxy server requests the corresponding page from the native Website in real-time and serves this page to the user.

This approach bridges the gap by ensuring that the indexed pages retain the integrity of your site’s internal hierarchical linking structure, resulting in greater PageRank “flow.” Since these pages are not static but served in real-time as they are requested, any indexed pages are always in sync with your online catalog once they are clicked on by Google or Yahoo! searchers; as a result, those customers receive the latest pricing and inventory availability data.

There is, not surprisingly, a downside to this option too. URLs may not be the only technical force conspiring against your Website’s crawlability. Elements such as JavaScript-based links and requiring cookies to browse are but a few examples of other potential crawler barriers. If Web design elements such as these are baked into your site structure, then even rewriting URLs by proxy will not resolve the crawlability barriers your Website faces. More-drastic measures will have to be taken.

Beyond URLs

What other roadblocks to crawlability do you need to remove besides problematic URLs? Bear in mind that crawlers do not surf your site with cookies, with JavaScript, or Java enabled. Nor can they fill in forms or operate drop-down navigation. And they don’t cope particularly well with MacroMedia Flash.

Ideally, you should try to keep your navigation free of any of these elements, to ensure optimal accessibility to crawlers. Of course, some online marketers with brand-centric site requirements may still need to use technologies such as these. It is not uncommon for heavily or even mildly branded sites (those of J.C. Penney Co., J. Jill, and Tiffany & Co. come to mind) to use floating JavaScript navigation on their home pages or category pages to help conserve real estate or convey a particular tone. But this makes matters worse for search engines: Not only are the URLs uninviting to crawlers, but the JavaScript menus make the categories inaccessible as well.

Even in these cases, though, there are workarounds that enable you to not only retain the branding but simultaneously to reach out to an incremental searcher market — without compromising or gutting the look and feel, as most marketers often fear. For example, by slightly recoding its JavaScript rollover menus, RedEnvelope.com could make sure those links are accessible to crawlers.

With the holidays around the corner, it may pay to have an expert conduct a search engine optimization audit of your Website to identify any such uncovered opportunities. It could mean the difference between a bah-humbug season and a very merry one.


Stephan Spencer is founder/president of Netconcepts, a Madison, WI-based Web marketing agency that offers search optimization services. Brian Klais is Netconcepts’ vice president of eBusiness Services.