the non predictive part of predictive modeling Nov 1, 1999 12:00 PM
, Jim Wheaton
JobZone
Search and post jobs for the Multichannel Merchant. Including jobs for brand & agency marketers, e-commerce, catalog marketers, ops & fulfillment, direct marketing and more.
The bulk of predictive modeling doesn't involve statistics, but rather
research, analysis, and implementation
Some catalogers may be intimidated by the techniques required to build a
statistics-based predictive model. But actually generating the predictive
model - that is, creating the scoring equation - makes up about only 10% of
the entire six-step process. The remaining 90% encompasses the
nonpredictive part of predictive modeling: developing a sound research
design, creating accurate analysis files, performing careful exploratory
data analysis, implementing the model, and creating ongoing quality control
procedures.
While many database articles focus on the mathematics-intensive predictive
modeling part - step #4 - the rest of the process is just as, if not more,
important. Regardless of the modeling technique that you use, if you stint
on any of the nonpredictive steps, you'll probably end up with a model that
does not perform well.
Step #1: Developing a research design
In developing a research design, you're actually coming up with a realistic
goal and a practical strategy for attaining it. A sound research design
encompasses five components:
- a solvable problem
- representative mailings
- an optimal dependent variable (the behavior that the model is trying to
predict, such as response or sales)
- identification of selection bias (factors that can misleadingly skew the
results)
- an appropriate modeling universe.
The first component may seem obvious, but many companies look to a
predictive model to solve a problem that no model could ever solve. One
catalog client wanted to use modeling to double the overall response rate
of rental lists while maintaining the size of its prospecting mailings - a
task that probably would require a marketing revolution, not a model.
Although it's possible to build a predictive model to find segments of
rental lists that perform at twice the average rate, these segments will
generally represent just a minor portion of the total universe.
After identifying a solvable problem, you must then select a subset of
representative past promotions for the analysis file. Even under ideal
circumstances, it's challenging to predict future behavior by examining the
past. At the very least, you must work with a "typical" past that is
expected to be similar to the future, and not an unrepresentative
historical period. For example, a fundraising mailing for a gun control
organization or a National Rifle Association promotion that coincided with
the Columbine High School tragedy in Colorado this past April would not be
a good candidate for a predictive model, because response patterns during
this highly emotional time are not likely to sustain themselves into the
future.
It's also important to determine the optimal dependent variable. The most
common factors catalogers try to predict with models are response and
sales, both of which can be further broken down into gross vs. net. And
sometimes profit is the target. But there are no hard-and-fast rules about
which dependent variable to use - the circumstances of your business and
associated goals should determine the appropriate variable. Also, good
old-fashioned testing can help. You might try building two or three types
of models off the same data set using different dependent variables.
Be forewarned that an otherwise well-constructed predictive model can fall
victim to selection bias. Any model built off a heavily prescreened group
of promotions and then put into production without the same prescreen risks
failing. For example, a women's careerwear cataloger that has identified
its target audience as female yuppies may always screen males from prospect
mailings to maximize response, so any resulting model will not evaluate
gender. If the model is not consequently implemented with a gender screen,
there will be nothing to prevent it from identifying male yuppies as
excellent prospects.
And finally, you must arrive at the appropriate modeling universe. It often
makes sense to create subset universes and build multiple, specialized
models, such as splitting multibuyers from single buyers. Many of the
variables that are likely to drive a multibuyer model do not apply to
single buyers, such as the length of time between the first and second
orders.
Step #2: Creating analysis files
You must be sure that the analysis file is accurate, because the complex
process of appending response information to the promotion history files
can often render an otherwise perfect research design worthless. It's also
possible that the underlying database can be flawed. For instance, a
catalog/retail client decided to build a point-of-sale retail database
using reverse-phone-number look-up. Checkout counter employees asked
customers to supply their telephone numbers, which were then cross-checked
against a digital directory to identify the corresponding name and address
to the phone number. This way, information about the items purchased could
be tracked to specific individuals and households, and a robust historical
database constructed.
Although the average order size for the retail side of the business was
about $80, the data also showed that a small but significant number of
customers had orders totaling several thousands of dollars.
But before the client could envision ways to leverage these superbuyers,
research from the analysts revealed that most were hardly super, and many
were not even buyers. Certain sales clerks resented having to request a
phone number from every customer. Some minimized this obligation by
recording the phone numbers of each day's initial 10-15 customers and then
recycling them for all subsequent customers; other clerks entered their own
phone numbers and those of their friends; and some did the same with random
numbers from the phone book. These strategies generated "pseudobuyers"
rather than superbuyers. A predictive model that included such observations
would be far from optimal. After all, the oldest rule of database marketing
is "garbage in, garbage out."
Step #3: Exploratory data analysis
Even with a sound research design and accurate analysis files, the work
involved in building a model has only just begun. While all of the
predictive modeling software packages on the market are able to recognize
patterns within the data, it's hard to identify those patterns that make
sense and are likely to hold up over time.
Exploratory data analysis will help you determine whether the relationships
among the potential predictors (that is, the historical factors you're
considering including in the model, such as RFM and overlay demographics)
and the dependent variable make ongoing sense. But make sure only those
relevant potential predictors end up in the final model.
A good analyst will capture the underlying dynamics of the business being
modeled, a process that involves defining potential predictors that are
permutations of fields within the database. An example might be the total
number of orders divided by the number of months on file.
You must also identify and either eliminate or control any errors,
outliers, and anomalies. Errors are data that don't reflect reality;
outliers are real but atypical behaviors, such as a $10,000 order in a
business that otherwise averages $80 orders; and anomalies are real but
unusual behaviors caused by atypical circumstances, such as poor response
due to call center problems. (For more on exploratory data analysis, see
"Data detectives," May 1998 issue.)
Step #5: Deploying the model
A model is worthless if you can't accurately deploy it in a live
environment, but database formats can change between the time that you
build the model and when you deploy it. It's also common that the values
within a given field will be altered. Say you have a department field in a
retail customer model. If the value of "06" corresponds to "sporting goods"
on the analysis file but to "jewelry" on the database, then the model is
not likely to succeed.
The chart above makes intuitive sense in that the best-performing decile
displays the most recent buyers on average, as well as the highest average
number of orders and total dollars. Patterns across the other nine deciles
also make sense, with buyers becoming less recent, and average orders as
well as average dollars declining consistently.
To gain confidence in a model, you should:
- deploy it on an appropriate universe
- sort the corresponding individuals from highest to lowest performance as
predicted by the model, or from highest to lowest model score
- divide these sorted individuals into units of equal size (often, a
grouping of 10 units called a decile is used)
This is an extension of the previous step, because a model can be deployed
for as long as several years, but you have to establish quality-control
procedures to ensure that the model continues to function the way the
analyst intended.
For effective quality control, you should create profiles of the model
units every time you deploy the model in a live environment, and compare
the profiles with the original analysis file profiles. If you don't see
consistency over time, there may have been changes within the database,
subtle or otherwise, that require further detective work to unravel. Such
inconsistencies raise a red flag that can help identify potential
difficulties before they become problematic.
As an extreme example of the costly mistakes that can occur without quality
control, consider a customer predictive model that was built for a
cataloger by an independent analytic consulting firm. The model was
forwarded to the service bureau that maintained the cataloger's database,
accompanied with the instruction to "pull off the top four deciles." No
quality-control procedures were included. The service bureau was mindful of
industry standards and proceeded to select deciles 1 to 4 for the promotion.
The results were abysmal and came close to putting the client out of
business. A post-mortem uncovered the embarrassing fact that the analytic
shop, contrary to industry standards, had labeled its best-to-worst deciles
from 10 to 1. As a result, the four worst deciles had actually been mailed.
With quality-control reports, this disaster would have been avoided. For
example, the service bureau would have known that a problem existed when it
saw the following recency profile:
It's important to remember that there is no magical short cut when building
a predictive model. If you want good results, concentrate on the
nonpredictive part of the predictive modeling process: developing a sound
research design, creating accurate analysis files, performing careful
exploratory data analysis, accurately putting the model into production,
and creating ongoing quality-control procedures. With careful attention to
detail in these five areas, you should be able to better target your most
responsive customers and prospects.