 Computational Methods - Specifying the Criteria for Predictive Accuracy

The goal of classification tree analysis, simply stated, is to obtain the most accurate prediction possible. Unfortunately, an operational definition of accurate prediction is hard to come by. To solve the problem of defining predictive accuracy, the problem is "stood on its head," and the most accurate prediction is operationally defined as the prediction with the minimum costs. The term costs need not seem mystifying. In many typical applications, costs simply correspond to the proportion of misclassified cases. The notion of costs was developed as a way to generalize, to a broader range of prediction situations, the idea that the best prediction has the lowest misclassification rate.

The need for minimizing costs, rather than just the proportion of misclassified cases, arises when some predictions that fail are more catastrophic than others, or when some predictions that fail occur more frequently than others. The costs to a gambler of losing a single bet (or prediction) on which the gambler's whole fortune is at stake are greater than the costs of losing many bets (or predictions) on which a tiny part of the gambler's fortune is at stake. Conversely, the costs of losing many small bets can be larger than the costs of losing just a few bigger bets. One should spend proportionately more effort in minimizing losses on bets where losing (making errors in prediction) costs you more.

Priors. Minimizing costs, however, does correspond to minimizing the proportion of misclassified cases when Prior probabilities are taken to be proportional to the class sizes and when Misclassification costs are taken to be equal for every class. We will address Prior probabilities first. Prior probabilities, or, a priori probabilities, specify how likely it is, without using any prior knowledge of the values for the predictor variables in the model, that a case or object will fall into one of the classes. For example, in an educational study of high school drop-outs, it may happen that, overall, there are fewer drop-outs than students who stay in school (i.e., there are different base rates); thus, the a priori probability that a student drops out is lower than that a student remains in school.

The a priori probabilities used in minimizing costs can greatly affect the classification of cases or objects. If differential base rates are not of interest for the study, or if one knows that there are about an equal number of cases in each class, then one would use equal priors. If the differential base rates are reflected in the class sizes (as they would be, if the sample is a probability sample) then one would use Prior probabilities estimated by the class proportions of the sample. Finally, if you have specific knowledge about the base rates (for example, based on previous research), then one would specify Prior probabilities in accordance with that knowledge. For example, a priori probabilities for carriers of a recessive gene could be specified as twice as high as for individuals who display a disorder caused by the recessive gene. The general point is that the relative size of the Prior probabilities assigned to each class can be used to "adjust" the importance of misclassifications for each class. Minimizing costs corresponds to minimizing the overall proportion of misclassified cases when Prior probabilities are taken to be proportional to the class sizes (and Misclassification costs are taken to be equal for every class), because prediction should be better in larger classes to produce an overall lower misclassification rate.

Misclassification costs. Sometimes more accurate classification is desired for some classes than others for reasons unrelated to relative class sizes. Regardless of their relative frequency, carriers of a disease who are contagious to others might need to be more accurately predicted than carriers of the disease who are not contagious to others. If one assumes that little is lost in avoiding a non-contagious person (avoiding a person who is non-contagious does not pose much of a danger or threat, and implies little cost) but much is lost in not avoiding a contagious person (one might acquire the disease by erroneously approaching and interacting with a contagious person), then higher Misclassification costs could be specified for misclassifying a contagious carrier as non-contagious than for misclassifying a non-contagious person as contagious. But to reiterate, minimizing costs corresponds to minimizing the proportion of misclassified cases when Prior probabilities are taken to be proportional to the class sizes and when Misclassification costs are taken to be equal for every class.

Case weights. A little less conceptually, the use of case weights on a weighting variable as case multipliers for aggregated data sets is also related to the issue of minimizing costs. Interestingly, as an alternative to using case weights for aggregated data sets, one could specify appropriate Prior probabilities and/or Misclassification costs and produce the same results while avoiding the additional processing required to analyze multiple cases with the same values for all variables. Suppose that in an aggregated data set with two classes having an equal number of cases, there are case weights of 2 for all the cases in the first class, and case weights of 3 for all the cases in the second class. If you specify Prior probabilities of .4 and .6, respectively, specify equal Misclassification costs, and analyze the data without case weights, you will get the same misclassification rates as you would get if you specify priors estimated by the class sizes, specify equal Misclassification costs, and analyze the aggregated data set using the case weights. You would also get the same misclassification rates if you specify Prior probabilities to be equal, specify the costs of misclassifying class 1 cases as class 2 cases to be 2/3 of the costs of misclassifying class 2 cases as class 1 cases, and analyze the data without case weights.

In the Classification Trees module, case weights are treated strictly as case multipliers. The misclassification rates from an analysis of an aggregated data set using case weights will be the identical to the misclassification rates from the same analysis where the cases have been duplicated in the data file the specified number of times. A full range of options are available for specifying priors. Prior probabilities can be specified to be Estimated, Equal, or User-specified. Misclassification costs can be specified to be Equal or User-specified. When Misclassification costs are User-specified, Adjusted priors are computed using the procedures described in Breiman et al. (1984), and the analysis proceeds as if the Adjusted priors and Equal Misclassification costs were specified for the analysis.

The relationships between Prior probabilities, Misclassification costs, and case weights become quite complex in all but the simplest situations (for discussions, see Breiman et al, 1984; Ripley, 1996). In analyses where minimizing costs corresponds to minimizing the misclassification rate, however, these issues need not cause any concern. Prior probabilities, Misclassification costs, and case weights are brought up here, however, to illustrate the wide variety of prediction situations that can be handled using the concept of minimizing costs, as compared to the rather limited (but probably typical) prediction situations that can be handled using the narrower (but simpler) idea of minimizing misclassification rates. Furthermore, minimizing costs is an underlying goal of classification tree analysis, and is explicitly addressed in the fourth and final basic step in classification tree analysis, where in trying to select the "right-sized" tree, one chooses the tree with the minimum estimated costs. Depending on the type of prediction problem you are trying to solve, understanding the idea of reduction of estimated costs may be important for understanding the results of the analysis.