You are on page 1of 20

Testing New Direct Marketing Offerings: The Interplay of Management Judgment and Statistical Models Author(s): Vicki G.

Morwitz and David C. Schmittlein Source: Management Science, Vol. 44, No. 5 (May, 1998), pp. 610-628 Published by: INFORMS Stable URL: http://www.jstor.org/stable/2634468 Accessed: 08/12/2009 10:53
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/action/showPublisher?publisherCode=informs. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.

INFORMS is collaborating with JSTOR to digitize, preserve and extend access to Management Science.

http://www.jstor.org

Testing

New

Direct of

Marketing

Offerings:

The

Interplay

ManagementJudgment

and

Statistical

Models

Vicki G. Morwitz * David C. Schmittlein


Stern Schoolof Business,New YorkUniversity,New York,New York10012-1126 The WhartonSchool, University of Pennsylvania,Philadelphia, Pennsylvania19104-6371

launch of a new product or service via direct marketing is nearly always preceded by a test of that offering. Such a "live" test, conducted with a subset of the entire list of customer prospects, can sometimes be useful in a "go/ no-go" decision regarding a full-scale launch of the offering. More commonly, the test is used to direct the offering more effectively toward the market segments that appear most promising. Specifically, test results are used and useful to determine whether a particular rental list of customer prospects should indeed be rented, and (for both rental and in-house lists) which specific customer segments should be contacted with the offering. This paper examines the effectiveness of managers' decisions related to designing a test and interpreting test results both conceptually-based on the literature of heuristics and biases in expert judgments-and empirically, for two new direct marketing offers. The paper describes how an interplay of management judgment and statistical models can lead to increased profits for new direct marketing offerings. (Marketing;Models;Managers;Segmentation)

The

1. Introduction
Two fundamental characteristics of direct marketing have been keys to its impressive growth as a promotion and distribution medium for products and services. The first is the individualized nature of the customer contact, which-at least in principle-allows the firm to direct its energies toward those individuals most likely to be responsive. Increasingly, this selectivity is necessary to enable the use of high-cost / high-impact marketing programs such as personal selling in the business-tobusiness area, or distribution of product demonstration and sales-oriented videocassettes, CD-ROMS, etc. Besides this selectivity in who to approach, the individualized customer contact can also permit customization of what to offer this individual. That is, different offerings or different selling approaches may be used with various customer prospects (Blattberg and Deighton 1991).
MV MANAGEMENT SCIENCE/ Vol.

The second fundamental characteristic is the individualized nature of the customer response, that is, each purchase can be associated with an identified individual, leading over time and across products to construction of a detailed customer purchase history. These customer histories tend to be far more useful predictors of future customer behavior than are such traditional descriptions as geodemographics and psychographics (Business Week1994, p. 58; Roberts and Berger 1989, p. 104). As a result, knowledge gained from this individualized customer response tends to activate and direct the individualized customer contact "next time." As long as the cost of creating and using these customer histories is not too great, this cycle of individualized contact, response, customer record updating, contact, . tends to lead successive direct marketing programs to increasing profitability. The organization's commitment to direct marketing increases accordingly.
0025-1909/ 98/4405/ 0610$05.00
Copyright 1998, Institute for Operations Research and the Management Sciences

610

44, No. 5, May 1998

MORWITZ AND SCHMITTLEIN TestingNew Direct MarketingOfferings

For a specific new direct marketing offer firms are well-justified in creating, a priori, based on observed customer histories, segments believed to differ in their predisposition toward the offering. The state of the art in such a priori segmentation is well-summarized in Robert Kestnbaum's FRAT scheme (Stone 1979). Here, the following four specific elements of the customer history are combined via appropriate weights to create segments of varying appeal: (1) Frequency of past purchases by the individual, (2) Recency of the purchasing, (3) Amount spent by the individual in the past, and (4) Type of products / services bought to date. While the benefit to this kind of segmentation is great, the firm typically wishes to improve on it in one significant respect, by letting the above cycle of contact, response, updating, . . . operate within a single direct marketing program rather than simply across successive programs. Traditionally, this is accomplished by testing the offering separately for each of the a priori segments, by presenting the offering to a subset of the segment's members and measuring the resultant response rate. The test results are used to decide whether or not to "roll" the offering to the remainder of the segment. An alternative approach would be to select a random sample from the overall population of potential buyers, and to model responsiveness to the test as a function of information describing sample members (e.g., using logistic regression models). Use of such alternatives by direct marketers has been limited by two factors. First, the coding for type of product/ service bought to date usually requires significant management judgment and fits more naturally into a categorization scheme (often hierarchical) than into a set of interval or binary measures. Second, nonlinearities and interactions are often thought to be important and to be captured more readily via categorization a priori than by usual general linear (or nonlinear) models. Accordingly, in this paper we focus on the a priori segmentation-then-testing approach prevalent in direct marketing practice. Implementation of such a test strategy requires specific decisions concerning how to design the test and how to interpret the test results. In particular, decisions must be made concerning how to use historical information about customers, i.e., FRAT data, to define segments for the test. These decisions are likely to depend

on historical evidence concerning which of the FRAT variables play a significant role in affecting response rates. After the test is completed, a decision must be made concerning which segments to select for the roll. This decision is likely to depend on the observed test response rate for each segment, the FRAT description of each segment, the size of each segment, and the purchase rate necessary so that rolling to a segment will be profitable. At present, these decisions typically are made by "corporate policy" (i.e., management judgment codified for general applicability) or are, for each new offering, assigned to some individual or committee (i.e., case by case management judgment). The purpose of this paper is to (1) examine the overall effectiveness of managers' test / roll decisions, both conceptually based on the literature of heuristics and biases in expert judgments, and empirically, for two new direct marketing offers, (2) identify particular test / roll decisions that managers seem to make well and suggest reasons why expertise is beneficial in these contexts, (3) identify particular test / roll decisions that appear more subject to error and bias and suggest why expertise does not aid in decision making in these contexts, and (4) demonstrate how managers' performance could be improved with the help of simple statistical models. In particular, rather than suggesting replacing managers' decisions with those of a model, or combining or averaging managers' and models' predictions, we suggest an interplay of management judgment and models. That is, we recommend relying on managers' judgment for tasks where expertise leads to superior performance, and relying on models where managers are prone to biases. We examine the effectiveness of managers' decisions using information provided by a large direct marketing firm gathered during the testing and full-scale launch of two new product offerings. For both products, we are able to examine the segmentation scheme used by the firm's managers, the sampling plan for the test, managers' projections of segment response rates based on test results, and the actual product launch roll results. Profit consequences of the various decisions are assessed using relevant price and cost data. The judgments of experienced managers are compared with both the actual market responses and the predictions of simple statistical models.

MANAGEMENTSCIENCE/Vol. 44, No. 5, May 1998

611

MORWITZ AND SCHMITTLEIN Testing New Direct MarketingOfferings

The next section describes the details of the specific direct marketing test/roll decisions and provides reasons why we should expect managers to perform some tasks well and be prone to biases for others. The paper then proceeds through the test / roll process: For the two major stages of that process we evaluate the managers' performance and demonstrate how an interplay between management judgment and models will lead to increased profits for new direct marketing offerings. The paper concludes with an overview of findings across the types of decisions.

2. The Hierarchy of Test/Roll Decisions


In considering the market testing of a new product or service it is natural to imagine the central decision of interest as a "go / no-go" determination. In practice, the substantial investment in productive capacity, inventory, and logistical support required for such a test means that this "go" bridge has in most cases already been crossed. The real purpose of most tests is to understand better how-and to whom-to sell the offering as well as to uncover any unanticipated problems (Wind 1982, p. 402). In direct marketing product tests, the two main purposes of testing are even more specific: (1) to determine to which segments of a population the offer should be rolled, from among a set of segments tested, and (2) to determine which price / message format should be used with the offer, from among a small (e.g., 1-4) set of different offers tested. In this paper we focus on the testing of a single new direct marketing offering, i.e., we consider explicitly only the first of these two decisions. Certainly, the offering price / message decision requires a panoply of management judgments, but our emphasis on targeting as opposed to product/ service design stems from the following two considerations. First, in current practice the design and targeting decisions are usually approached independently. Essentially, the factorial nature of the offerings / segments options makes it infeasible to consider them jointly. Second, current testing practice places much more emphasis on segments than offers, e.g., considering a handful of different offers, but dozens or hundreds of different segments. This suggests that managers see greater het-

erogeneity that was not obvious a priori, and is discoverable via testing, across segments than across offering designs. In other words, there is more money to be made testing segments than testing varied offer designs. Certainly, failure to properly define and test the target market has been said to be among the most common reasons for failure in direct marketing (Bly 1989). Given information on the past purchase history of customers (i.e., FRAT data), the two key management decisions in direct marketing testing for a given new offering are (1) how to use FRAT data to form segments for the test, and (2) how to interpret test results and select segments for the roll. For this study, then, we will focus on the use of management judgment and statistical models as bases for these two decisions. For both decisions we will present some evidence on the accuracy of managers' judgments, and thus highlight the potential for interplay with simple statistical models. In assessing the performance of judgments of experienced managers, it is important to note that direct marketing is a field where experienced managers are likely to be quite accurate. This accuracy stems from the basic characteristic of direct marketing responses mentioned at the outset, which makes it much easier to see and learn from "what worked" in previous direct marketing programs than is the case when other kinds of promotional / distribution methods are used. In particular, the testing of new direct marketing offerings is characterized by the two properties essential for learning to occur (Camerer and Johnson 1991, Goldberg 1968): (1) repetition-a similarly structured task is undertaken for each of a succession of offerings, and (2) feedback-the performance, at least in absolute terms, of the test is observable. On the other hand, decision makers in general are known to be prone to the use of certain heuristics (Kahneman, Slovic, and Tversky 1982) that relate to the specific direct marketing decisions at hand, and sometimes result in biased decision making. Further, those limitations may not disappear with increasing expertise or experience (Bowman 1963, Dawes and Corrigan 1974, Johnson 1988). Whether or not a model can improve on managers' decisions will depend on whether the unusually great opportunity to learn from past direct marketing experience can overcome both the idio-

612

SCIENCE/Vol. 44, No. 5, May 1998 MANAGEMENT

MORWITZ AND SCHMITTLEIN TestingNew Direct MarketingOfferings

syncrasies of each new offering and managers' inherent inclination to use certain decision heuristics. In motivating the results to follow we next describe in slightly more detail the two main direct marketing test decisions. 2.1. Forming Segments for the Test For each customer in the firm's database managers know past purchase history information as summarized by FRAT variables. For example, the firm may know what products the customer has purchased in the past, how much money has been spent, and how much time has passed since the last purchase. These FRAT variables are used to create a set of segments. The segments should be designed such that receptiveness to the new product is expected a priorito vary across the segments. The new product offering is then sent to a sample of customers in each segment. The test results are used to determine whether or not it will be profitable to roll to the segment, i.e., offer the new product to the remaining customers in the segment. In forming these segments, managers have several decisions to make. First, they must decide which of the available FRAT variables are useful for forming segments by identifying those that are predictive of propensity to purchase. Second, they must decide how to combine FRAT variables to form segments. Third, for any variable they use, they must select cutoffs or groups for the variables' values to form segments. A good segmentation scheme should have several characteristics. First, the results from the test conducted on the segments should clearly identify profitable and unprofitable segments. Therefore, segments should be formed by splitting variables that are most predictive of propensity to buy. Second, segments should be subdivided more finely the closer they are expected to be, a priori, to the profitability hurdle (i.e., the response rate required for a profitable roll to the segment) to clearly differentiate between profitable and unprofitable segments. Third, when subdividing segments, the objective should be to split off the "poorest" part of "good" segments (i.e., a nonprofitable part of a profitable segment) and to split off the best part of poor segments (i.e., the profitable part of nonprofitable segments). In short, these three desiderata will increase the propensity for the total customer prospect list to be segmented into

either high response segments (relative to the profitabilityhurdle) or low responsesegments.Doing so will not only maximizethe opportunityto selecthighlyprofbut itablerespondents, will also minimizethe likelihood for erroneousconclusionsfrom the test, i.e., due to random chance, concluding that an unprofitablesegment would be profitablein the roll, and vice versa. With the first of these criteriain mind, segments are often formed in a hierarchical manner; that is, customers are first segmented on the variable thought to be most important in discriminating propensity to buy. If further segmentation is desired, additional variables are used to subdivide segments formed from the most important variable. For example, if time since the last purchase is considered to be the most importantvariable,managerswill first form segments based on ranges of that variable. The resultant segments might be: 1. customerswho-purchasedin the last month; 2. customerswho purchasedin the last six months, but not in the last month; 3. customerswho purchasedin the last year,but not in the last six months; 4. customerswho have not purchasedin the lastyear. If dollars spent is the second most importantvariable in explainingpropensityto buy, segments expected to be close to the profitabilityhurdle may be subdivided on dollars spent. The resultant segmentation scheme might then be: 1. customerswho purchasedin the last month; 2a. customerswho purchasedin the last six months, but not in the last month and spent $50 or more; 2b. customerswho purchasedin the last six months, but not in the last month and spent less than $50; 3a. customerswho purchasedin the last year,but not in the last six months and spent $100 or more; 3b. customerswho purchasedin the last year,but not in the last six months and spent less than $100; 4. customerswho have not purchasedin the lastyear. i.e., Essentially,the relativeweight of these two criteria, time since last purchase and amount spent, is determined by theirrespectiveplace in the hierarchyand by how finely each variable is categorized.The development of these types of hierarchiesshould be a natural task for experts who tend to use configural rules, i.e., the impact of one variable depends on another
613

MANAGEMENT SCIENCE/Vol. 44, No. 5, May 1998

MORWITZ AND SCHMITTLEIN Testing New Direct MarketingOfferings

(Camerer and Johnson 1991, Tversky and Kahneman 1982). One of the FRAT variables, type of product, requires additional managerial judgment. When forming segments based on type of product for a new product test, managers need to determine how "similar" the new product is to previous products offered in the past and how to group different types of products. For example, if the new product is a nature book about snow leopards, managers might create a hierarchy of segments based on past purchases such as: 1. customers who have bought nature books about leopards, 2. customers who have bought nature books about other wild cats, 3. customers who have bought other nature books, 4. customers who bought other wild cat-related products, 5. customers who have bought any nature-related products. Note that these categories are not mutually exclusive. Accordingly, they are used as a hierarchy, whereby each customer is placed in the first (topmost) segment for which he or she qualifies. For any given new product offering, managers will need to identify a specific set of type-of-product segments. For example, while the segmentation scheme shown above would be useful if the new product is a nature book about snow leopards, a different segmentation scheme would be needed if the new product was a computer game for children. The literature on expertise and decision making suggests reasons to expect that managers will perform relatively well at some of these tasks and more poorly for others. Based on this literature, we posit propositions concerning the tasks we expect managers to perform well and the tasks for which managers could benefit from the use of a model. In the segment formation task, managers must identify relevant FRAT variables and then combine and weight them to form segments. One strength of experts is their ability to identify relevant variables (Armstrong 1985, Bunn and Wright 1991, Johnson 1988). Thus, we should expect experts to perform well at identifying which FRATvariables will help discriminate propensity to purchase. Therefore, we posit:

PROPOSITION 1. Managerswill be able to identifywhich FRAT variablesare most usefulfor forming segments.

One demonstrated weakness of experts is their inability to apply appropriate weights when combining variables (Einhorn 1974, Johnson 1988). Thus, although experts can identify useful variables, they do not appear to apply weights correctly or consistently when combining variables to predict an outcome such as likelihood of purchase (Hoch and Schakde 1996). In contrast, simple linear models appear to perform as well or better than experts at applying weights to variables (Blattberg and Hoch 1990, Johnson 1988). Therefore, we expect that:
PROPOSITION 2. Managerswill not combinevariablesto form segmentsas effectivelyas models.

Finally here, we note that the performance of models has been poor relative to experts when there are dynamic changes in the environment (Braun and Yaniv 1992). In the direct marketing context, the only FRAT variable that changes from one product test to another is type of product. While it would be difficult for a model to determine the similarity of a new product to products launched in the past, managers should be well suited for this task. In particular, prior studies have shown that experts perform well at pattern matching (Hoch and Schkade 1996, Wierenga and van Bruggen 1997). Managers' previous experience with direct marketing products should allow them to match the new product with previous products launched by the firm and to determine the relative similarity of the current product to previous products. Managers should be able to use information on relative product similarity to create a hierarchy of segments based on what types of product consumers have purchased in the past. Accordingly, we expect that: well, in bothabPROPOSITION 3. Managerswill perform solute terms and relative to models, at forming segments basedon type of product. These propositions relate to the identification and use of important segmentation variables, i.e., the first of the three segmentation desiderata described at the outset of this section. The other two desiderata above, subdividing segments more finely the closer they are expected

614

MANAGEMENT SCIENCE/Vol. 44, No. 5, May 1998

MORWITZ AND SCHMITTLEIN Testing New Direct MarketingOfferings

to be to the profitability hurdle and splitting off the poorest part of good segments and the best part of poor segments, have not to our knowledge been discussed in the direct marketing literature. Accordingly, we have not offered specific propositions regarding managers' effectiveness at these two goals. It is unclear whether either goal is in fact being pursued often in practice. We do note, however, that these two decisions involve the combination of variables and hence the type of activity that decision makers find difficult. Seeing whether managers seem to pursue such goals, and if they do, how effectively, is a descriptive objective of the current research. 2.2. Interpreting Test Results and Selecting Segments for the "Roll" Once the test has been conducted, managers interpret the test results and use them to select segments for the roll. Consider some segment i having in total Ni members formed a priori based on FRAT variables. Letting C be the (known) contribution margin per respondent and D be the (known) cost to contact each individual with the offer, the firm should "roll" to a segment, i.e., realize a positive expected profit from that segment, if the segment's overall responserate Pi is greater than D / C. Some subset of the segment of size Mi (chosen at random from the entire segment) was tested, with Xi positive responses resulting. To keep the exposition simple, each respondent is assumed to order one item at a single price and to actually pay for it. Contexts where respondents do not necessarily buy only one item at a single common price, or do not always pay, can generally and easily be incorporated by appropriate revision of C above, and hence of the profitability cutoff D / C. This segment's observed test response rate is Xi / Mi Ri To select segments for the roll the manager requires an estimate of the response rate for each segment, Pi, that would be observed in a roll. In this context there are two potentially useful sources of information. The first is the test result (Xi, Mi) both for this segment i in particular and for all other segments tested. The second is any information contained in the FRAT variables that describe each segment that was not alreadyused to form the segment and thereby already reflected in the segment's test results above. Managers can use these two

sources of informationto estimate each segment's roll responserate.Theestimatecan thenbe comparedto the profitabilitycutoff to determinewhether or not to roll to each segment.Note thatthis specificdirectmarketing taskis a specialcase of a generalclass of problemsstudied by Einhornand Hogarth(1978).1 Theyexaminedthe relationshipbetween learning and experience for the problem of predicting a binary outcome (e.g., profitable/ not profitable) using a cutoff on a continuous judgment (e.g., predicted roll response rate). Their frameworkand findingsguide some of the propositions to follow. Clearly,in an ideal setting, managerswould use all diagnostic FRAT information to form segments and have resultantsegmentslargeenough to ensurethatresponse ratesestimatedfrom a test using only a portion (e.g., 10 percent)of the segment will be highly reliable. In such an ideal world all the useful FRATinformation would be capturedin the test resultvariationacrosssegments, and consequentlythe manager'sjob in predicting roll response based on test results would involve to simply a small statisticaladjustment the test resultaccountingfor the remaining small instability in that result-and not at this stage using the FRATresults at all. In the real world, managersmay not have fully or optimally used the FRATvariablesto form segments, so those variablescan have predictivevalue for the roll response rates even when the test response rates have the been observed.Further, test responseratesmay not be highly reliabledue to relativelysmall segment sizes and low response rates, so again the FRATvariables may have incrementalpredictive value for predicting roll response rates. How accuratelymanagersforecast roll response rates will obviously depend on how well fromthe test,namely,the they can combineinformation test response rate and the test sample size, with any FRAT not information alreadyreflectedin the test result itself. As previouslynoted, expertsdo not performwell at weighting differentpieces of informationin coming up with an overallforecast(Johnson1988).Past studies have shown that managerstend to use configuralrules

1 We acknowledgethe assistanceof the journal'sassociateeditor in highlightingthis connection.

MANAGEMENT SCIENCE/Vol.

44, No. 5, May 1998

615

MORWITZ AND SCHMITTLEIN TestingNew Direct MarketingOfferings

to combine information rather than linear combination rules, even though linear combination rules often lead to more accurate predictions (Johnson 1988). In addition, the literature on managers' judgmental adjustments of forecasts from statistical models suggests that managers are prone to a double counting bias (Bunn and Salo 1996), that is, managers adjust models for variables whose effects are already captured by the model, for instance due to collinearity with variables included in the model. This suggests that managers, in estimating roll response rates, might adjust test response rates for the effect of FRAT variables, even if the effects of those variables are already represented in the observed test response rates. Furthermore, people tend not to seek or examine disconfirming evidence (Einhorn and Hogarth 1978). This suggests the manager may underweight low test response rates for segments whose FRAT variable values have led to an expectation of high response rates, and vice versa. All these arguments suggest:
PROPOSITION 4. Managerswill overweightthe effectsof FRAT variableswhen predictingroll responserates.

In addition, research on correlational learning has shown that people tend to code outcomes as frequencies rather than probabilities (Einhorn and Hogarth 1978, Estes 1976). Estes (1976) demonstrated that frequencies have a greater influence on predictions than probabilities, even when high frequencies were associated with low probabilities. This suggests that managers might pay more attention to how many purchases occurred in a segment than to the proportion of segment members tested who purchased (i.e., the test response rate). Therefore, we expect: 5. Managers will overweightthe number of purchasesmade by segment membersrelative to the test responserate.
PROPOSITION

As a predictor of the segment's roll response Pi, the test response rate is known to suffer from a statistical regression effect, which market researchers and statisticians usually call regression to the mean. Segments with relatively high purchase propensities in the test tend to have lower purchase rates in the product roll, whereas segments with relatively low test rates usually are not selected for the roll, but if they are, tend to have higher purchase rates in the roll than they did in the test

(Shepard 1990, p. 137). This phenomenon is well known to those who routinely analyze direct marketing test data, although they may not refer to it by name (e.g., Reiss 1989). The phenomenon essentially arises as a result of the binomial sampling error in the test results. Even if segments were all identical in roll response rate Pi, the sampling error would induce a range of test response rates. More generally, when the Pi do indeed vary across segments, the additional sampling error in testing will produce more variation in Xi / Mi across segments than in Pi. So high test responses should be interpreted as somewhat indicative of a high roll response Pi, but also as arising somewhat from a spurious "luck" in sampling segment members for the test. As a result, extreme test response estimates should be shrunk toward the center of the distribution in predicting roll responses. How much a segment's response rate should be regressed to the mean depends on how far the segment's test response rate is from the mean test response rate, and also on the segment sample size. That is, segments with smaller sample sizes regress to the mean more than segments with larger sample sizes, holding test response rate constant. How accurately managers will predict roll response rates from test response rates depends on whether their experience and observations from previous tests and rolls allow them to subjectively determine how, and how much, to adjust test market results in estimating roll results. The published literature indicates that the frequent failure to incorporate such an effect accounts for the widespread inferiority of forecasters relative to mathematical models (Bowman 1963, Dawes and Corrigan 1974, Tversky and Kahneman 1982), but experienced observers may sometimes detect regression to the mean, whether they understand its origin or not (Kahneman and Tversky 1973). Nonetheless, in probably the most relevant finding to date, experienced retail merchandise buyers fail to account for regression to the mean when making item sales projections based on initial sales rates (Cox and Summers 1987). There are several reasons, in this direct marketing context, why managers might not sufficiently regress test response rates to the mean when forecasting roll response rates. First, managers are not exposed to complete feedback on how much test response rates regress

616

MANAGEMENT SCIENCE/Vol. 44, No. 5, May 1998

MORWITZ AND SCHMITTLEIN TestingNew Direct MarketingOfferings

to the mean during rolls, because they only observe roll response rates for segments to which they decided to roll. Thus, the feedback they receive is truncated. Second, people have great confidence in extreme judgments and may therefore not regress extreme cases (those that should be most regressive) sufficiently (Einhorn and Hogarth 1978, Kahneman and Tversky 1973). Kahneman and Tversky (1973) further found that people tend to overweight case-specific information such as test response rates and underweight base rate information of the sort represented by the mean test response rate across segments. Thus, one reason managers might not sufficiently regress test response rates to the mean is their using an anchoring and adjustment heuristic to estimate the roll response rates. They may anchor on the test response rate and not sufficiently adjust for the mean test response rate across segments. Third, evidence suggests that managers will not properly adjust for the effect of sample size in making the roll predictions. Experiments conducted by Kahneman and Tversky (1972) demonstrate that people in general do not intuitively understand that sampling error decreases as sample size increases. Combined, these suggest:
PROPOSITION 6. Managers will not sufficientlyregress test responserates to the mean.

E[Xi IPi, Mi] = PiMi, Var[Xi IPi, MJ] = Pi(l - Pi)Mi. (2)

Segments are designedto differ in their response rates Pi. A model for interpreting test results or for designing the test must incorporate this heterogeneity. A parsimonious descriptive model in this respect will thus require a distributionf(Pi) for response rates across segments. A flexible and convenient distribution is the beta, with density function f(Pi Jas /3) = B R( -)P'

(1 - Pi),6-', O< Pi < 1; (3)

where B(a, 3) is the beta function (Abramowitz and Stegun 1965). Across segments, the mean and variance of segment response rates Pi are:
E[Pia,
=8]

a a
(4+

Var[Pi a, 6] =

In contrast, predicting roll response rates from test response rates is a task that is easily handled by a simple statistical model, the beta binomial (Greene 1982, Kalwani 1980). Building on the notation already introduced, imagine that K segments have been formed a priori, with segment i having unknown response rate Pi and known Ni members in total. We wish to roll to the segment if Pi > D / C; and will test Mi segment members with the offering, eventually receiving Xi responses to the test. Under the assumption that individuals are chosen at random for the test and that the segment sizes are "large," the distribution for Xi conditional on segment response rate Pi and test size Mi is the binomial; with probability function, mean, and variance: P P[X = x P, x Mi]
=

The beta distribution is the conjugate prior for Xi's binomial distribution (Schmittlein 1989) and can represent a variety of density function shapes for the distribution of response propensities, including unimodal, Ushaped, and J-shaped (Greene 1982, p. 102). Consider now a segment whose true response rate Pi is unknown. For such a randomly chosen segment, the distribution for the number of test responses, Xi, is the beta binomial (Greene 1982):
P[Xi
=

x a, /, Mi]
MiB(a + x,

/3+

Mi

x)

( x) with mean and variance:

B(a, 3)

E[Xi Ja, , Mil] =aM Var[Xi i a, /, Mi] = M(a 2/3(a + / + M1) (6)

,(

P(i

p()Mix,

x = of 1/ I mil ...

(1)

For a randomly chosen segment, from whom Mi members have been tested, the variance in observed test response rate is

MANAGEMENTSCIENCE/Vol. 44, No. 5, May 1998

617

MORWITZ AND SCHMITTLEIN TestingNew Direct MarketingOfferings

Var[Xi/MiI a,8 , Mi] =

af3(a + /3 +

M+)

(7)

The two free parameters a and 3 of the beta distribution must be estimated for each offering being tested or anticipated prior to the test based on past experience. They can be estimated by fitting the first two moments of test responses as in (6), or by maximum likelihood using Equation (5). The latter approach was used for the empirical results to follow. These MLEs take into account the differing numbers of members tested Mi across segments, as has been discussed by Kalwani (1980). With the test results in hand, the beta binomial model can be used to infer the roll response in segment i. It does so via a weighted average of the observed test response (Xi / Mi) and the average response for segments in general a / (a + /) with the weight placed on the test response determined by the reliability of that measure Xi /Mi. The formula is (Greene 1982, Schmittlein 1989): (aa+, E[PiIXi,Mi,a,]
=

+ Mi +

)(X)
( a+a
8J

(a +/

(a+f3)

+Mi)

a + Xi a+3+Mi

(8)

For projecting segment roll results based on the test, this is the quantity that will be compared with actual roll results and managers' forecasts.

3. Results
In this section we first consider the performance of managers in the segment formation task. We compare the profit consequences of the manager's segments to the estimated profitability had management judgment been combined with the findings from a simple model. Next, we assess the accuracy of the managers' roll forecasts relative to some natural benchmarks. 3.1. Direct Marketing Test Data The empirical results to follow cover two separate consumer durable direct mail offerings, which we will label A and B. For each product the firm provided the following information: (a) the number of customers in each a priori segment and a brief description (based on FRAT

and gender) of the customers contained in that segment, (b) the number of customers tested in each segment, (c) the number of orders placed in the test, (d) the managers' prediction of the number of orders that would be received in a roll for each segment, and (e) the actual number of orders in the roll (if the segment was rolled). For both products, segments were formed a priori based on the customer's purchase history, using the FRAT (Frequency, Recency, Amount spent, Type of product) criteria. Each FRAT-based segment was then subdivided based on the presumed gender of the household contact person. Product A was more likely to be bought by males; Product B by females. Prior to the test, then, each segment was described by five variables: 1. Type of product: A subjective ranking of the responsiveness based on similarity between products purchased in the past and the current offering. The product types represent groups of previous offers that customers may or may not have purchased. Each individual is placed in the highest-ranked product type for which he/she qualifies based on his or her cumulative purchase history with the firm. 2. Recent: A zero/ one variable indicating whether the individual is a recent entrant to the database. Such an individual is a relatively new customer to the firm, i.e., his or her first purchase with this firm for any kind of product occurred approximately within the past year. 3. Hiatus: The number of years since placement of the customer's most recent order. That is, hiatus = 5 denotes a customer who, as of the current decision point in time, made his or her most recent purchase five years ago. 4. Amount spent: Customer's cumulative monetary spending with the firm; in dollars. 5. Sex: The gender of the contact. Notice, therefore, that among the four conceptual elements of the FRAT scheme (Frequency, Recency, Amount spent, Type of product), this firm uses the latter two (Type, Amount) directly, uses two operationalizations of Recency ("Recent" and "Hiatus") and does not explicitly use Frequency (except through its correlation with the other variables). Product A was tested on 124 segments and rolled to 52 customer segments, which ranged in size from 2,000 to 37,000 members. It was sold for $295.00, and across segments its roll response rate varied from 0.1%to 4.3%.

618

MANAGEMENT SCIENCE/Vol. 44, No. 5, May 1998

MORWITZ AND SCHMITTLEIN TestingNew Direct MarketingOfferings

Product B was tested on 128 segments and rolled to 73 segments, ranging in size from 1,000 to 56,000 members. It sold for $195.00, and its roll response rate varied across segments from 0.02% to 3.34%. Profitability cutoffs, D / C, were calculated by the firm for each product (for Product A, D / C = 0.00239; for Product B, D / C = 0.00207). The managers at the firm were asked to act in accordance with this profitability criterion in making their roll decisions. We use this information to examine the profit consequences of managers' decisions. 3.2. Using FRAT Data to Form Segments for the Roll To determine how accurately managers form segments and whether there is potential for an interplay between models and managers, we first estimate a simple linear regression model relating the FRAT variables and Sex to the test response rate. The results from this model, provided in Table 1, will tell us whether managers are using the variables most predictive of propensity to buy in forming segments. Across offerings A and B, Type of product bought previously was the most significant predictor and Hiatus was the second most important predictor of test response rates, based on the overall F statistic for Type III sum of squares accounted for by the variable. None of the other FRAT variables was significant. The variable Sex was significant for Product A but not for Product B. Tables 2A and 2B show the a priori segmentation scheme used by the managers for each product. In the tables we show the segmentation scheme based on the customer FRAT variables. Most of the subsegments shown in the tables were subsequently split based on presumed gender: male / female. An examination of these two segmentation schemes reveals that managers do give primacy in the hierarchy to the most important variable, i.e., Type of product. They are also able to select, a priori, those types of products bought previously that will most positively or negatively influence propensity to respond to the offering of question. Indeed, note the significance of the variable type of product in Table 1. The rank correlation between the managers' judgmental ranking of Type-of-product segments on response rate with actual response rate in the test was 0.405 for Product A (p < 0.001) and 0.364 for Product B (p < 0.001).

In subdividing the type-of-product segments, among the three remaining FRAT variables managers appropriately relied most heavily on Hiatus, i.e., the remaining variable found most effective in our statistical analysis. Among the 134 Type-of-product segments subdivided in Tables 2A and 2B, 87 percent used Hiatus as a subdivision criterion, 43 percent used Amount spent, and 17 percent used Recent entrant status. Thus, the results provide support for P1: managers are able to identify a priori the two most important FRATvariables for forming segments and also to identify the order of importance among the two. Looking further at the two segmentation schemes, we see that the subdivision of Type-of-product segments forces managers to select a specific weighting or combination of FRAT variables-and for this task these managers' effectiveness breaks down. Note that 12 percent of the 134 subdivided segments do not use Hiatus at all when it is the only remaining FRAT variable with any real segmentation ability. Across the 38 type-ofproduct segments in the two tables, only 55 percent (21) were subdivided on Hiatus. Further, even when Hiatus was used to subdivide Type-of-product segments, in 24 percent of these cases (five of 21), additional FRATvariables having no predictive ability were combined with Hiatus. Even more pathologically, for nine of the 38 Type-of-product segments (i.e., 24 percent), these nonpredictive FRAT variables were used in lieu of Hiatus in forming subsegments. This set of results supports P2: Although managers were able to identify the most useful segmentation variables, they were not able to identify the appropriate weights or combinatorial combinations for those variables. Also supporting P2 is the observation from Tables 2A and 2B that managers are not systematically subdividing segments more finely in the middle of the Type-of-product rank, nor are they systematically subdividing by splitting off the poorest part of "good" segments and the best part of "poor" segments. Finally here, we note that managers were indeed effective in their use of the most important segmentation variable: Type of product. That is, they accurately ranked the numerous categories for this variable from most responsive to least responsive, thus, supporting P3. Having observed these results, we developed a simple alternative segmentation scheme that represents an

MANAGEMENTSCIENCE/Vol. 44, No. 5, May 1998

619

MORWITZ AND SCHMITTLEIN TestingNew Direct MarketingOfferings

Table1

ModelingSegment Test Response Rate ProductA (R2 = 0.503) Coefficient Std. Error 0.002392 0.001048 0.000369 0.001795 0.000003 0.002989 0.002726 0.004056 0.002539 0.003104 0.002128 0.002127 0.004373 0.004373 0.004373 0.001629 0.002991 0.003019 0.002544 NA t Statistic 2.01 2.20 -3.36 0.01 1.62 8.06 4.08 0.61 1.42 1.20 0.74 0.19 -0.65 -0.86 -0.59 0.53 -0.50 0.21 -0.13 NA p Value 0.0473 0.0302 0.0011 0.9900 0.1086 0.0001 0.0001 0.5409 0.1598 0.2345 0.4621 0.8498 0.5194 0.3932 0.5547 0.5997 0.6151 0.8362 0.8942 NA Coefficient 0.005621 -0.000074 -0.000934 0.002857 0.000001 0.006015 0.040852 0.000373 -0.000326 -0.002116 0.008395 0.002078 0.001734 0.001861 0.002967 0.001500 -0.000351 -0.0001827 -0.004390 0.000897 0.003990 0.003660 0.005618 0.000000 0.001471 0.000000 0.001976 0.000000 ProductB (R2 = 0.769) Std. Error 0.004475 0.000972 0.000335 0.002176 0.000004 0.004927 0.004730 0.005054 0.006053 0.006053 0.004664 0.004387 0.004664 0.004387 0.004664 0.004410 0.005640 0.006053 0.006053 0.004436 0.004551 0.003660 0.005499 0.005499 0.005499 0.005499 0.005499 NA t Statistic 1.26 -0.08 -2.79 1.31 0.28 1.22 8.64 0.07 -0.05 -0.35 1.80 0.47 0.37 0.42 0.64 0.34 -0.06 -0.30 -0.73 0.20 0.88 0.78 1.02 0.00 0.27 0.00 0.36 NA p Value 0.2120 0.9396 0.0064 0.1923 0.7799 0.2249 0.0001 0.9414 0.9571 0.7273 0.0749 0.6367 0.7108 0.6723 0.5261 0.7344 0.9505 0.7634 0.4700 0.8401 0.3827 0.4345 0.3094 1.0000 0.7897 1.0000 0.7201 NA

Intercept Gender(0 = Female;1 = Male) Hiatus(Years) Recent (0/1 Indicator) Amount(Dollars) Type of Product(Category): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0.004801 0.002302 -0.001242 0.000022 0.000004 0.024093 0.011136 0.002488 0.003594 0.003111 0.001571 0.000404 -0.002827 -0.003749 -0.002592 0.000858 -0.001509 0.000626 -0.000339 0.000000

interplay between managers and models. We first segment by Type of product. This variable incorporates managers' judgment since it is created using managerial expertise. Whenever we subdivide a Type-of-product segment, we do so using the only other variable that has been shown empirically to be an important predictor, namely Hiatus. Hiatus was the only significant variable in addition to Type of product for both Products A and B. Accordingly, for each of Products A and B, developing a segmentation scheme without benefit of hindsight regarding what actually happened in the test for that given product (and instead using the knowledge from the "other" product) leads for both products to segmentation on Hiatus within Type of product.

The details of how we subdivide on Hiatus for the two products are depicted in Figure 1. In general, we want the segmentation scheme we develop to have the following characteristics. We first create segments using the most important variable (i.e., Type of product). Assuming the profitability cut-off will be somewhere toward the middle of the Type-of-product-based hierarchy, Typeof-product segments should be split more finely toward this midpoint. Accordingly, for Product A in Figure 1, Type-of-product segments that are most promising (#1) and least promising (#15) are split into two subsegments each, while Type-of-product segments "in the middle" (#7, 8, 9) are each split into five subsegments. In addition, at the top of the Type-of-product hierarchy, the second

620

MANAGEMENT SCIENCE/Vol. 44, No. 5, May 1998

MORWITZ AND SCHMITTLEIN TestingNew DirectMarketingOfferings

Table2A

Managers'a PrioriSegmentationSchemes: ProductA Subsegment Definitions

H:AverageHiatus(years) Type of Product A: AverageAmountSpent (Dollars) Bought Previously R: 1 = Recent Entrant; = Not Recent Entrant 0 (RankedCategory) Defaults:R = 0, A = $300, H = 3 (If R = 1; H= 1) 1 2 3 4 5 6 7 8 9 10 11 11 (cont.) 11 (cont.) 11 (cont.) 12 13 14 15 15 (cont.) 15 (cont.) H=1 H=1 H= 1 R=1 H= 1 R = 1; A = 125 R=1 R=1 R= 1 R= 1; A = 250 H= 1; A = 500 H= 4; A = 175 H= 6;A = 175 H=1 H= 2 H= 2.5 H= 1; A = 125 H= 3; A = 125 H = 5.5; A = 750 H=4 H=3 H= 2.5 R=0 H 2 R =1; A = 500

H=5 H= 5.5 H =3 H= 1 H= 4 H= 2 H= 5.5 H= 3

H= 5

R= 1; A = H =2; A = H= 4; A = H= 6;A = H=3 H= 5 H= 5 H= 1; A = H= 3; A =

750 125 500 375

R= H= H= H=

1; H= 2; A = 5; A = 6;A =

0.5; A =0 500 50 750

R= H= H= H=

1; H= 0.75; A =0 3; A = 125 5; A = 175 7

R =1; H= H= 3; A = H= 5; A = H= 3;A =

1; A 500 500 0

H= 1; A = 125 H= 4; A = 50 H= 6; A = 50

H = 5.5 375 375 H= 1; A = 750 H= 3; A = 750 H= 2; A = 125 H= 4; A = 125 H =2; A = 375 H= 4; A = 500 H= 2; A = 750 H= 5.5; A = 250

Note: Each of these default values characterizeany subsegment for which that variableis absent from the subsegment's definition.For instance, if a the subsegment's definitiondoes not mentionamountspent, the averageamountspent in that subsegment is (approximately) defaultvalue of $300.

most important variable (i.e., Hiatus) should be used to split off the least promising portion of each segment, i.e., those individuals with a long Hiatus since previous purchase. For Product A in Figure 1 the most promising Type-of-product segment (#1) has only the least promising Hiatus-based subsegment (Hiatus of five or more years) split off in subsegmentation. This scheme should continue for segments above the average; however, the bottom portion split off should become larger the closer the segment is to the middle of the hierarchy. The mirror image of this strategy should be used for segments below the middle of the hierarchy. Specifically, Hiatus should be used to split off the most promising part of these segments (e.g., short one-year Hiatus for Type-of-product segment #15 for Product A; in Figure 1). Thus for both products, we developed what we call a "trimmed pyramid segmentation" scheme.

To assess the profit consequences of our segmentation scheme and the managers' segmentation scheme, we use the predicted roll response rates from a regression model of test response rate on Type of product, Amount, Recent, Hiatus, and Sex. We use predicted response rates from the model, rather than response rates observed in the test, to characterize the managers' performance in creating segments. We do so for two reasons. First, this provides for an "apples-to-apples" comparison with our proposed segmentation method, since the response rates for the latter must be projected using a model. Second, the observed test response rates suffer from the statistical uncertainty (regression-to-themean) effect discussed earlier: The variation observed across subsegments' test response rates will not fully hold up in a roll. It turns out that this effect operates to the detriment of the assessed managerial performance.

MANAGEMENTSCIENCE/Vol. 44, No. 5, May 1998

621

MORWITZ AND SCHMITTLEIN TestingNew Direct MarketingOfferings

Table2B

Managers'a PrioriSegmentationSchemes: ProductB Subsegment Definitions


H: AverageHiatus(years)

Type of Product BoughtPreviously

(Ranked Category)
1 2 3 4 5 6 7 8 9

A: AverageAmountSpent (Dollars) 0 RF: = Recent Entrant; = Not Recent Entrant 1 R Defaults: = 0, A = 300, H = 3 (IfR= 1; H = 1) H=2 H= 1; R= 1 R=1 R=1 R=1 H= 1.5 H= 1 H= 1.5 H= 1 H= 1.5 H= 1.5 R =1 R= 1 R =1; A = H= 2; A = H= 4; A = H= 6; A = H= 1.5 H= 1.5 H=5 H= 1; R= 0 R=0

H= 4

10 11
12 13 14 15 15 (cont.) 15 (cont.) 15 (cont.) 16 17 18 19 20 21 22 23

H= H= H= H= H= H=

3.5 2 3.5 2 3.5 3

H= H= H= H= H= H=

5.5 3 5.5 3 5.5 4

H= 4 H= 4 H= 5

H= 5 H= 5 H= 6

H= 6.5 H= 6.5 H= 7.5

50 125 300 175

R= H= H= H= H= H=

1; A = 2; A = 4; A = 6; A = 3.5 3.5

300 500 750 375

R= H= H= H= H= H=

1; A = 3; A = 5; A = 6; A = 5.5 5.5

750 50 50 750

H= H= H= H= H=

1; A = 750 3; A = 300 5; A = 300 7 7.5

H = 1; A = 500 H= 3; A = 750 H = 5; A = 750

H= 1; A = 250 H= 4; A = 50 H = 6; A = 50

Note: Each of these default values characterizeany subsegment for which that variable is absent from the subsegment's definition.For instance, if a the subsegment's definitiondoes not mentionAmountspent, the averageamount spent in that subsegment is (approximately) defaultvalue of $300.

That is, evaluating managers based on actual test response rates would increase substantially the degree to which our proposed segmentation method outperformed managers. We compare the profitability of the trimmed pyramid segmentation scheme to the managers' scheme and to a relatively naive scheme where we segment only on the most influential variable (i.e., Type of product). The results are shown in Table 3. For Product A, the projected profit of the managers' segmentation scheme and roll strategy is $261,760. The projected profit from the trimmed pyramid scheme is 11.85 percent greater ($292,791) than the manager's scheme and 41.15%

greater than segmenting on Type of product alone ($207,434). The results for Product B are similar. The projected profit of the managers' segmentation scheme and roll strategy is $1,125,397. The projected profit from the trimmed pyramid scheme is 10.71 percent greater ($1,245,920) than the manager's scheme and 41.15% greater than segmenting on Type of product alone ($1,021,350). These results highlight the benefit available from a considered "interplay" between models and management judgment. That is, instead of relying fully on one or the other, or on an overall average of the two, our proposed segmentation method relied on management

622

MANAGEMENT SCIENCE/Vol. 44, No. 5, May 1998

MORWITZ AND SCHMITTLEIN TestingNew DirectMarketingOfferings

Figure1

Trimmed Pyramid Segmentation:HiatusWithinTypeof Product

Product A ProductType 1 2,3 4,5,6 7,8,9 10,11,12 13,14 15 Numberof Hiatus Sub-Segments 2 3 4 5 4 3 2 1 1 1 1 1- 2 2 2 2 2+ 1-3 3 3 3 3+ Hiatus-BasedSubsegmentDefmition Hiatus Ranges (Years) 1-4 4 4 4 4+ 5+ 5+ 5+ 5+

Product B Product Type 1,2,3 6,7,8 9,10,11,12 15,16 17,18,19,20 21,22,23 Numberof Hiatus Sub-Segments 2 3 4 4 3 2 1 1 1 1- 3 2 2 2+ 3 3+ 1-4 4 Hiatus-BasedSubsegmentDefinition Hiatus Ranges (Years) 1- 5 5 5 4+ 6+ 6+ 6+

of to on Type product 5, 13, and14 weredefined be homogenous recency so further hiatus. on 4, (i.e., "recent entrants"), couldnotbe segmented

judgment where those judgments are likely to be effective and substitutes the systematic benefit of a formal model where judgments are relatively ineffective. Doing so produced a benefit unrealizable without this interplay. 3.3. Interpreting Test Results and Selecting Segments for the Roll As discussed earlier, after the test results become available the managers judgmentally predicted roll response

rates segment-by-segment. To determine how managers weight different sources of information to forecast roll response rates, we regressed managers' predicted roll response rates on two variables related to test results (the test response rate = Number of orders / Number of mailings, and the Number of orders) and five variables describing the segments (Type of product, Recent, Hiatus, Amount, and Sex). We also regress the actual roll response rates on the same variables. By comparing the coefficients for these two models we can determine

MANAGEMENTSCIENCE/Vol. 44, No. 5, May 1998

623

MORWITZ AND SCHMITTLEIN TestingNew Direct MarketingOfferings

Table3

ProfitConsequencesof SegmentationSchemes ProductA ProductB 903,782 8,191 $1,021,350 870,635 8,775 $1,125,397 1,285,500 10,358 $1,245,920 10.71% 21.99%

Segmentationon Type of ProductAlone

size Mailing ProjectedNumberof Responses ProjectedProfit Mailing size ProjectedNumberof Responses ProjectedProfit Mailing size ProjectedNumberof Responses ProjectedProfit PercentageImprovement over Managers PercentageImprovement over Type of ProductAlone

978,646 4,115 $207,434 572,351 3,607 $261,760 781,188 4,372 $292,791 11.85% 41.15%

Managers' Segmentationand RollStrategy

TrimmedPyramid: HiatuswithinType of Product

whether managers over- or underweight different types of information.2 Note that managers' predicted roll response rates and the actual roll response rates are only known for those segments to which the product was actually rolled. Our first observation is that managers appear to double count pieces of information that are already contained in the test response rate. Table 4 reports the results from these two models. For both products, managers' predicted roll response rates were significantly affected by the test response rate (Product A: p < 0.01; Product B: p < 0.01) and the Type of product (Product A: p = 0.01; Product B: p = 0.02), and were marginally affected by the number of orders (Product A: p = 0.09; Product B: p = 0.07), based on two-tailed tests of the significance of the coefficients. For both products, the actual roll response rate was only significantly affected by the test response rate (Product A: p < 0.01; Product B: p < 0.01). These results show that managers overweighted one of the FRAT descriptors (supporting P4) and overweighted the number of purchases (supporting P5). We also find evidence that managers did not sufficiently regress test response rates to the mean. Table 4 shows how this regression effect "should" be managed (in the "Actual Roll Response" column), and how these
2

We include both the test response rate and the number of orders in the model for conceptual reasons. Note, however, that the results from this regression model should be interpreted with caution since the correlation between these two variables is significant.

direct marketing managers actually use the test response in forming their forecast ("Managers' Forecast" column). For example, for Product A, a regression of actual roll response on test response has slope 0.33, while the analogous regression for managers' forecasts has slope 0.42. Although managers are shrinking their roll response estimates toward the mean (since the coefficient 0.42 is less than 1), they are overweighting the observed test response (i.e., 0.42 > 0.33). The same pattern occurs for Product B, only more dramatically. Roll response is related to test response with a slope of 0.44, but managers forecasts' exhibit a slope of 0.66. Accordingly, "high" test response segments receive a managerial roll forecast lower than the test result, but not reduced appropriately far. Conversely, managers increase their estimate of roll response for segments that tested poorly, but again not to a sufficient degree. In the face of significant direct marketing experience with previous products, and a direct marketing literature giving significant attention to this regression effect, it is clear that our managers do detect, and anticipate, regression to the mean. However, consistent with P6, we find they just do not do enough of it. These results suggest there is opportunity for an interplay between managers' judgment and a statistical model. Since managers did not sufficiently regress test response rates and since they double count information already contained in the test response rate, we use an interplay between the model described in ?2.2 and the managers' judgments to select segments for the roll. In particular, we use the model's predicted roll response

624

MANAGEMENT SCIENCE/Vol. 44, No. 5, May 1998

MORWITZ AND SCHMITTLEIN TestingNew Direct MarketingOfferings

Table4

Modeling Managerial the Forecastof RollResponse Quantity Predicted Managers'Forecast Std. Error ActualRoll Response Std. Error

to The Relation Test Response ProductA Intercept Test Response Numberof Test Orders Type of Product Recent Hiatus Amount Sex (0 = Female;1 = Male) ProductB Intercept Test Response Numberof Test Orders Type of Product Recent Hiatus Amount Sex (0 = Female;1 = Male)

Coefficient

t Statistic

p Value

R2

Coefficient

t Statistic

p Value

R2

0.003995 0.419505 0.000697 -0.000299 -0.001129 -0.000220 -0.000000 0.000739 0.001621 0.658081 0.000075 -0.000134 0.000274 -0.000020 0.000001 -0.000308

0.00155 0.05970 0.00040 0.00011 0.00118 0.00034 0.00000 0.00101 0.00119 0.04039 0.00004 0.00005 0.00075 0.00018 0.00000 0.00058

2.574 7.027 1.743 -2.646 -0.958 -0.640 -0.064 0.733 1.366 16.293 1.837 -2.459 0.365 -0.109 0.638 -0.532

0.0135 0.0001 0.0883 0.0113 0.3435 0.5258 0.9494 0.4677 0.1773 0.0001 0.0714 0.0169 0.7167 0.9133 0.5258 0.5966

0.79

0.002744 0.326318 0.000561 -0.000226 -0.000831 -0.000735 0.000003 0.002689 0.003443 0.435512 0.000043 -0.000128 0.000920 -0.000303 -0.000005 0.000259

0.00246 0.09370 0.00065 0.00017 0.00185 0.00054 0.00000 0.00158 0.00160 0.05455 0.00005 0.00007 0.00101 0.00025 0.00000 0.00078

1.126 3.483 0.869 -1.329 -0.449 -1.360 0.788 1.689 2.148 7.983 0.783 -1.730 0.908 -1.229 -1.648 0.331

0.2661 0.0011 0.3895 0.1907 0.6557 0.1806 0.4350 0.0965 0.0359 0.0001 0.4367 0.0890 0.3678 0.2241 0.1048 0.7417

0.58

0.96

0.85

rates to identify segments that the managers would roll to (i.e., their predicted roll response rates are above the profitability cutoff), but the model predicts will not be profitable. Ideally, we would also want to use the model to identify segments that the managers predicted would be unprofitable, but the model predicts will be profitable. Unfortunately, since we only know the roll response rate for segments rolled to by the managers, we can only evaluate the profit consequences of the interplay between models and managers for the former. The results are presented in Table 5. For Product A, the managers rolled to 52 segments. This produced 572,350 individual contacts, 2,452 orders, and a profit of $126,594. The model recommended rolling to only 46 of these 52 segments. This would have produced 513,528 contacts, 2,308 orders, and $126,243 profit. In this case the model recommends contacting almost 59,000 fewer people, and the resultant profit was within $400 (or 0.3 percent) of that generated by the managers. In light of direct marketers' concerns regarding list wearout from repeated contacts, the benefit from "saving" these

59,000 individuals as fresh prospects for another offering can also be substantial. In short, where the manager said "roll" and the model disagreed, following the model led to a savings in number of contacts without substantially reducing profit. The results are even more promising for Product B. Of the 73 segments actually rolled to and hence recommended for roll by the managers, only 55 are selected by the model for pursuit. The model's policy results in 155,000 fewer contacts, and therefore individuals saved for another offering, and $10,000 higher profit ($499,544 versus $489,540). Of course, for each product, the model also recommended rolling to several segments in opposition to the manager's prediction. By definition, the model predicts even further profit from such segments. However, the response that would arise in these segments is unobservable since they were not selected for the roll. Even though we cannot document empirically these potential gains here, we can at least conclude that the model clearly has benefit in highlighting specific segment roll

MANAGEMENT SCIENCE/Vol.

44, No. 5, May 1998

625

MORWITZ AND SCHMITTLEIN TestingNew Direct MarketingOfferings

Table5

OutcomesAssociated with Managerand ModelRollPolicies ProductA Manager Model 46 63 37 513,528 0.00449 1229 2308 $126,243 58,822 Manager 73 36 64 870,635 0.00550 1802 4787 $489,540 ProductB Model 55 42 58 715,632 0.00633 1481 4527 $499,544 155,003

Segments recommendedfor roll (amongthose rolled) Percentof segments with actual roll response: -Abovecutoff -Below cutoff Numberof contacts recommended Response rate in roll Orderneeded to breakeven Ordersreceived Profitfrom roll List memberssaved from roll

52 62 38 572,350 0.00428 1370 2452 $126,594

recommendations by managers that are of dubious merit. In this direct marketing context, we think it important to stress the benefit represented by a model that reliably and effectively can replace the need for repetitious elicitation of forecasts from experienced managers. While the managers and model seem to perform equivalently as stand-alone generators of roll decisions based on test results, the managers' expertise is hard-won through time and can be lost through illness or turnover. The model needs no expertise beyond the test results (and, of course, the construction of segments in the first place), so is immediately and consistently available.

4. Conclusion
This paper has demonstrated the accuracy of management judgment for routine decisions central to the design and analysis of direct marketing tests. For the two product offerings examined here, our results suggest an interplay between managerial judgment and simple statistical models leads to increased profits. In the test design decision, an interplay increases profits by more than 10 percent. For the decisions related to interpretation of test results, gains from the use of the model's implications are modest, but can be well worth securing. Perhaps equally importantly, the qualitative nature of optimal policies may be a useful input to managers" judgments.

Most managerial decisions are based on a rich assortment of individual, though not necessarily separate, assessments. Instead of seeking to replace managers with a model, or to average the final decision from a model with that of a manager, we have sought to identify those individual assessments that managers should do relatively well, and add the use of simple, appropriate models for the remaining assessments needed. That is what we mean by an interplay between models and managers. The application reported in this paper indicates that managerial effectiveness for making individual assessments can be evaluated a priori based on both the set of behavioral decision research findings in the literature and the consideration of managers' previous opportunity for learning, i.e., repetition and feedback. Our two key direct marketing test decisions differ structurally in two ways, and these differences affect the opportunity for model /manager interplay. Consider the second decision-prediction of roll response rates from test results. The first key characteristic is the effectiveness of managers' past experience, and here seeing the actual roll results provides a great opportunity for learning. Second, the decision requires a single numeric forecast for each segment, and when forecasting a single number it is not clear how one can get the best part of the managers' judgments (i.e., intuition, "broken leg cues") without also getting the ineffective parts. The same is true for the model. Pooling the model and managers by a weighted average of their separate forecasts

626

MANAGEMENT SCIENCE/Vol. 44, No. 5, May 1998

MORWITZ AND SCHMITTLEIN TestingNew Direct MarketingOfferings

does not eliminate the limitations or drawbacks in either of those forecasts. As a result, unsurprisingly, we did not find such a weighted average to represent an improvement. (These results are not reported in detail in this paper. For each product, for prediction of segment response rates, the weighted average of model-plusmanager was outperformed by the managers' judgments.) The first direct marketing test decision-designing a priori segments-differed from the second with respect to both key characteristics just mentioned. That is, managers do not have easy opportunities for learning here, because there is no straightforward mechanism for their seeing, after the test/ roll sequence is over, what the consequence to a differentsegmentation scheme would have been. In addition, by tradition the segmentation process is hierarchical, creating segments based on one variable and then subsegmenting based on one or more additional variables. In such a hierarchical process, use of management judgment at some levels of the hierarchy and a formal model at others is straightforward. As a result, for this segmentformation decision the model / manager interplay substantially outperforms what could be achieved with either the model or manager alone. The two key characteristics of this segment-formation decision are its hierarchical structure and the differential effectiveness of model-versus-manager across levels of the hierarchy. Additional managerial decision settings with the same two characteristics are considered in the rulebased forecasting work of Collopy and Armstrong (1992). Bunn (1996), Bunn and Salo (1996), and Bunn and Wright (1991) offer further discussion of model/ manager interplay that goes beyond averaging numerical predictions from each. Acquisition of good managerial experience can be costly in time and money. The experience can be lost all too quickly with employee turnover, or held "hostage" by an employee with this specialized knowledge. A model does not share these problems, and further, can sometimes be used to help train new managers in judgment. In addition, direct marketing is an activity uniquely associated with detailed and quick feedback, enabling one to learn "what worked." Accordingly, direct marketers have a history of commitment to this testing / learning cycle. For

at least the product offerings examined here, it has resulted in a great deal of specialized managerial knowledge. But when it is time to integrate that knowledge in decision making or forecasting, this base of experience does not appear to eliminate some of the systematic errors most commonly observed in the behavioral decision literature. Certainly it is possible to construct richer and more complex models for these direct marketing decisions, and we look forward to seeing such work as direct marketing receives greater attention from the marketing science research community. We hope, however, that this research serves as stimulus to benchmark such models against the actual managerial judgments, policies, and heuristics currently in place. Our results suggest that great improvements in the segment construction and test-retest decisions will be difficult to obtain, relative to the current performance of managers and simple models. However, there is a great opportunity to explore the use of model-based feedback to managers as a way to improve their judgments. We hope the results here will help stimulate such work.3
The authors thank the Marketing Science Institute for their support of this project, and thank the departmental editor, the associate editor, and the reviewers for their helpful comments throughout the review process.
3

References
Functions, of Abramowitz, M. and I. A. Stegun, Handbook Mathematical Dover Publications, New York, 1965. Armstrong, J. S., LongRange Forecasting,John Wiley, New York, 1985. Blattberg, R. C. and J. Deighton, "Interactive Marketing: Exploiting the Age of Addressability," Sloan ManagementRev., Fall (1991), 5-14. and S. Hoch, "Database Models and Managerial Intuition: 50% Model + 50% Manager," ManagementSci., 36 (1990), 887-899. Bly, R. W., "The Six Most Deadly Causes of Direct Marketing Disaster," Direct Marketing,52, 3 (1989), 69. Bowman, E., "Consistency and Optimality in Management Decision Making," ManagementSci., 9 (1963), 310-321. Braun, P. A. and I. Yaniv, "A Case Study of Expert Judgment: Economists' Probabilities Versus Base-rate Model Forecasts," J. Behavioral DecisionMaking, 5 (1992), 217-231. Bunn, D., "Non-traditional Methods of Forecasting," European Oper. J. Res., 92 (1996), 528-536. and A. Salo, "Adjustment of Forecasts with Model Consistent Expectations," International Forecasting,12, 1 (1996), 163-170. J. and G. Wright, "Interaction of Judgemental and Statistical Forecasting Methods: Issues and Analysis," Management Sci., 37, 5 (1991), 501-518.

MANAGEMENT SCIENCE/Vol.

44, No. 5, May 1998

627

MORWITZ AND SCHMITTLEIN TestingNew Direct MarketingOfferings

Business Week,"Database Marketing: A Potent New Tool for Selling," September 7 (1994), 56-62. Camerer, C. F. and E. J. Johnson, "The Process-Performance Paradox in Expert Judgment: How Can Experts Know So Much and Predict so Badly?" in K. A. Ericsson and J. Smith (Eds.), Towarda GeneralTheoryof Expertise:Prospectsand Limits, Cambridge University Press, Cambridge, UK, 1991. Collopy, F. and J. S. Armstrong, "Rule-BasedForecasting:Development and Validation of an ExpertSystems Approach to Combining Time Series Extrapolations,"Management Sci., 38 (1992), 1394-1414. Cox, A. D. and J. 0. Summers, "Heuristics and Biases in the Intuitive Projection of Retail Sales," J. MarketingRes., 24 (1987), 290-297. Dawes, R. and B. Corrigan, "Linear Models in Decision Making," PsychologicalBulletin, 81 (1974), 95-106. Einhorn, H. J., "Expert Judgment: Some Necessary Conditions and an Example," J. AppliedPsychology,59, 5 (1974), 562-571. and R. M. Hogarth, "Confidence in Judgment: Persistence of the Illusion of Validity," Psychological Rev., 85 (1978), 395-416. Estes, W. K., "The Cognitive Side of Probability Learning," Psychological Rev., 83 (1976), 37-64. Goldberg, L. R., "Simple Models or Simple Processes? Some Research on Clinical Judgments," AmericanPsychologist,23 (1968),483-496. Greene, J., ConsumerBehavior ModelsforNon-Statisticians.Praeger, New York, 1982. Hoch, S. J. and D. A. Schkade, "A Psychological Approach to Decision Support Systems," ManagementSci., 42, January (1996), 51-64. Johnson, E. J., "Expertise and Decision Under Uncertainty: Performance and Process," in M. T. H. Chi, R. Glaser and M. J. Farr (Eds.), The Nature of Expertise, Lawrence Erlbaum Associates, Hillsdale, NJ, 1988.

Kahneman, D., P. Slovic, and A. Tversky, Judgement UnderUncertainty: Heuristics and Biases. Cambridge University Press, Cambridge, 1982. and A. Tversky, "Subjective Probability: A Judgment of Representativeness," CognitivePsychology,3 (1972), 430-454. and , "On the Psychology of Prediction," Psychological Rev., 80 (1973), 237-251. Kalwani, M. U. (1980), "Maximum Likelihood Estimation of ZeroOrder Models Given Variable Numbers of Purchase per Household," J. MarketingRes., 27 (1980), 547-551. Reiss, Alan J., "Overriding the Test Mailing," DirectMarketing,52, May (1989), 85-87. Roberts, M. L. and P. D. Berger, DirectMarketingManagement,Prentice Hall, Englewood Cliffs, NJ, 1989. Schmittlein, D., "Surprising Inferences from Unsurprising Observations: Do Conditional Expectations Really Regress to the Mean?" TheAmericanStatistician,43 (1989), 176-183. Shepard, D., TheNew DirectMarketing.Dow Jones-Irwin, Homewood, IL, 1990. Stone, B., Successful Direct MarketingMethods. Crain Books, Chicago, IL, 1979. Tversky, A. and D. Kahneman, "Evidential Impact of Base Rates," in D. Kahneman, P. Slovic and A. Tversky (Eds.), JudgmentUnder Uncertainty: Heuristics and Biases, Cambridge University Press, New York, 1982. Wierenga, B. and G. H. van Bruggen, "The Integration of Marketing Problem-Solving Modes, and Marketing Management Support Systems," J. Marketing,61, July (1997), 21-37. Wind, Y., ProductPolicy: Concepts,Methodsand Strategy,Addison Wesley, Reading, MA, 1982.

receivedNovember4, 1994. This paperhas beenwith the authors16 monthsfor 3 revisions. Acceptedby Jehoshua Eliashberg;

628

MANAGEMENT SCIENCE/Vol. 44, No. 5, May 1998

You might also like