Gains Chart

The gains chart provides a visual summary of the usefulness of the information provided by one or more statistical models for predicting a binomial (categorical) outcome variable (dependent variable); for multinomial (multiple-category) outcome variables, gains charts can be computed for each category. Specifically, the chart summarizes the utility that one can expect by using the respective predictive models, as compared to using baseline information only.

The gains chart is applicable to most statistical methods that compute predictions (predicted classifications) for binomial or multinomial responses. In STATISTICA, gains charts can be computed in various modules, including General Classification and Regression Trees (GC&RT), GCHAID, Generalized Linear Models (Logit and Probit models for binomial responses), General Discriminant Analysis (GDA) (for binomial responses), etc. The Rapid Deployment of Predictive Models module will compute simple and overlaid gains charts (for multiple predictive models) based on models trained and deployed via PMML. This and similar summary charts (see Lift Chart) are commonly used in data mining projects when the dependent or outcome variable of interest is binomial or multinomial in nature.

Example. To illustrate how the gains chart is constructed, consider this example. Suppose you have a mailing list of previous customers of your business, and you want to offer to those customers an additional service by mailing an elaborate brochure and other materials describing the service. During previous similar mail-out campaigns, you collected useful information about your customers (e.g., demographic information, previous purchasing patterns) that you could relate to the response rate, i.e., whether the respective customers responded to your mail solicitation and the type of order they placed.

Given the baseline response rate and the cost of the mail-out, sending the offer to all customers would result in a net-loss. Hence, you want to use statistical analyses to help you identify the customers who are most likely to respond. Suppose you use STATISTICA General Classification and Regression Trees (GC&RT) to build such a model based on the data collected in the previous mail-out campaign. You can now select only the 10 percent of the customers from the mailing lists who, according to prediction from the C&RT model, are most likely to respond. Next you can compute the number of accurately predicted responses, relative to the total number of responses in the sample; this percentage is the gain due to using the model. Put another way, of those customers likely to respond in the current sample, you can accurately identify ("capture") y percent by selecting from the customer list the top 10% who were predicted by the model with the greatest certainty to respond (where y is the gains value).

Analogous values can be computed for each percentile of the population (customers on the mailing list). You could compute separate gains values for selecting the top 20% of customers who are predicted to be among likely responders to the mail campaign, the top 30%, etc. Hence, the gains values for different percentiles can be connected by a line that will typically ascend slowly and merge with the baseline if all customers (100%) were selected.

If more than one predictive model is used, multiple gains charts can be overlaid (as shown in the illustration above) to provide a graphical summary of the utility of different models.

See also Rapid Deployment of Predictive Models and Lift Chart.