Best Practices in Data Mining: The Second Five Commandments

May 01, 2007 9:30 PM  By

Data mining is enhanced, often dramatically, when the source data are improved. The ultimate goal is for data mining to be performed off a platform that we at Wheaton Group refer to as best-practices marketing database content. This, in turn, supports deep insight into the behavior patterns that form the foundation for data-driven decision-making.

Best-practices marketing database content provides a consolidated view of all customers and inquirers across all channels. The complete history of transactional detail must be captured. Everything within reason must be kept, even if its value is not immediately apparent.

There are 10 commandments that, if followed, will ensure best-practices marketing database content. In the February issue of Multichannel Merchant (see Best Practices in Data Mining: The First Five Commandments of Database Content Management), I discussed the first five:

Number 1: The data must be maintained at the atomic level.

Number 2: The data must not be archived or deleted except under rare circumstances.

Number 3: The data must be time-stamped.

Number 4: The semantics of the data must be consistent and accurate.

Number 5: The data must not be overwritten.

The following are the balance of the Ten Commandments:

NUMBER 6: Postdemand transaction activity must be kept.

Postdemand transaction activity can include cancellations, rebates, refunds, returns, exchanges, allowances, and write-offs. Keeping track of these is essential for important exercises such as identifying customers who will be less likely to make future purchases without remedial action. After all, customers who are disappointed by unavailable, ill-fitting, or damaged merchandise will be less likely to purchase in the future. One common data-mining application is attrition modeling.

The capture of postdemand activity is particularly important in sectors such as high-fashion women’s apparel, where return rates can be as high as 40%. Customers with similar gross purchase volume often have very different return rates. This, in turn, can make the difference between a profitable customer and one who will eat away at your margins. It makes sense for predictive models to take such discrepancies into account.

Tracking postdemand transactions can be a challenge because it requires the transactions to be retained by the underlying operational systems that feed the marketing database. Unfortunately many operational systems are not equipped for this task, and postdemand transactions vanish subsequent to a change in shipping status. For example, a “backorder” status will disappear once the corresponding item has been shipped.

Why is this problematic? Assume that an operational system feeds a marketing database update process on the first and 15th of every month. Also assume that a backorder is generated on June 2 and that the corresponding shipment takes place on June 14. The customer had to wait 12 days for the merchandise to ship, which certainly is not ideal. If the operational system does not retain backorder statuses, then the June 1 and June 15 “snapshots” that feed the marketing database will fail to reflect the 12-day wait. With only the June 12 shipment reflected, an important aspect of the customer relationship will have been lost.

NUMBER 7: Ship-to/bill-to linkages must be maintained.

Often these correspond to gift giver/recipient relationships. Ship-to/bill-to linkages allow targeted promotions to extend the customer universe beyond those who made the original purchase. Savvy database marketers look upon giftees as qualified prospects. In this way, you can use your customer database to drive targeted prospecting promotions, often with formal data-mining techniques.

NUMBER 8: All promotional history must be kept.

You must retain every promotional contact in every available channel. This is necessary to rapidly and accurately create the past-point-in-time “views” required for most data-mining projects, including predictive models. For multidivisional businesses, and especially those that have acquired other companies, it is important to appropriately handle different coding practices.

One multibillion-dollar retailer with a substantial catalog/e-commerce division learned the hard way the importance of retaining promotion history. Although the company spends seven figures a year on its CRM system, the underlying marketing database does not contain promotion history. As a result, most data-mining projects take a week longer than they should because of the extraneous processing required to overcome the lack of promotion history when creating analysis files.

NUMBER 9: Proper linkages across multiple database levels must be maintained.

For business-to-consumer companies, individuals must be properly linked to households. For business-to-business and business-to-institution marketers, individuals must be linked to sites, and sites to organizations. This allows the calculation of accurate performance metrics such as promotional financials and for understanding the true nature of multibuyers.

Such links also enable the tracking of pass-along response and the implementation of innovative targeting programs. For example, b-to-b direct marketers can monitor contract compliance across multiple sites within large client organizations. Say that discounted pricing is predicated on purchases not being made from the competition. By maintaining data across multiple levels, you could identify any sites within client organizations that have not received any mission-critical merchandise; such sites might not be complying with the contract.

NUMBER 10: Overlay data must be included, as appropriate.

For b-to-c, overlay data can be appended to create a complete view of customers, inquirers, and when applicable, prospects. One form of b-to-c overlay data is demographics for existing individuals and households on the marketing database, including date of birth, age, gender, marital status, and presence of children. Another is the identity of additional adults within households on the database, along with their corresponding individual-level demographics.

For b-to-b and b-to-i, “firmagraphics” can be added to create a complete view of customers, inquirers, sites, and organizations. These data include SIC or NAICS code, number of employees, and revenue. Also, additional individuals can be appended to sites that are resident on the database, and additional sites to organizations.

One primary data-mining application is the creation of profiles to paint a picture of customers and inquirers. The possibilities go far beyond that, however, and are limited only by the imagination. For example, you can use date-of-birth data to support birthday offers, targeting customers with upcoming birthdays with offers of special savings to “treat themselves.” You can also use the information to promote gifts to significant others within the households.

Consider whether you are working with best-practices marketing database content. The extent to which you are not is the extent to which you are artificially limiting the size of your firm’s revenue and profits. Also consider what methods you might use to improve database content by enhancing the functionality of your operational systems. There are all sorts of ways to do this. But that is the topic of a future article.


Jim Wheaton is a cofounder/principal at Wheaton Group, a Chicago-based data management, data mining, and decision sciences practice. The firm also offers full list processing capabilities through its Daystar Wheaton Group affiliate.

A CASE STUDY OF what not to do

Last year Wheaton Group was approached about a data-mining project by a well-known gift-oriented, multibillion-dollar retail and direct marketing company that has been in decline. It soon became apparent that the firm’s marketing database content would support neither the project nor any other form of meaningful data mining.

The company archives its data after 36 months in a way that made the information difficult to resurrect. Some portions of the database are maintained at the surname level and others at the individual level. For surname-level database records, only one individual’s identity is retained. This means that if a husband orders the first time, and the wife orders five subsequent times, the database will reflect six orders from the husband. This is particularly problematic for a gifts business. To complicate matters, the database does not track bill-to/ship-to linkages, nor does it contain gender codes.

Compounding the issue, often the acquisition source is inaccurate, which makes conducting many worthwhile analyses, such as long-term value, problematic to say the least. Also, merchandise coding discipline does not exist, the Website does not allow source codes to be entered, and customer records generally do not reflect postdemand transactions such as merchandise returns.

Promotion history is essentially unusable because the database tracks massive amounts of “spurious” activity — for example, “event occurrences” such as records that have been sent to the service bureau for National Change of Address (NCOA) processing. There are significant problems with tying promotion history to specific names and addresses, and e-mail promotions are not tracked at all.

On the retail side, distance-to-store calculations are based on imprecise zip-to-zip centroids. And they reflect only the nearest store, not where the purchase actually took place.

Clearly, unless the company rectifies the appalling state of its marketing database content, it will have little chance of reversing its decline.
JW