Interactive Trees (C&RT, CHAID) Introductory Overview

The STATISTICA Interactive Trees (C&RT, CHAID) module builds ("grows") classification and regression trees as well as CHAID trees based on automatic (algorithmic) methods, user-defined rules and criteria specified via a highly interactive graphical user interface (brushing tools), or combinations of both. The purpose of the module is to provide a highly interactive environment for building classification or regression trees (via classic C&RT methods or CHAID) to enable users to try various predictors and split criteria in combination with almost all functionality for automatic tree building provided in the General Classification and Regression Trees (GC&RT) and General CHAID Models (GCHAID) modules of STATISTICA.

The Interactive Trees (C&RT, CHAID) module can be used to build trees for predicting continuous dependent variables (regression) and categorical dependent variables (classification). The program supports the classic C&RT algorithm popularized by Breiman et al. (Breiman, Friedman, Olshen, & Stone, 1984; see also Ripley, 1996) as well as the CHAID algorithm (Chi-square Automatic Interaction Detector; see Kass, 1980).

Unique Advantages of the Interactive Trees (C&RT, CHAID) Module

While much of the functionality of the Interactive Trees (C&RT, CHAID) module can be found in other tree-building procedures of STATISTICA and STATISTICA Data Miner, there are a number of unique aspects to this program:

  • The program is particularly optimized for very large data sets, and in many cases the raw data do not have to be stored locally for the analyses.

  • Because the Interactive Trees module does not support ANCOVA-like design matrices, it is more flexible in the handling of missing data; for example, in CHAID analyses, the program will handle predictors one at a time to determine a best (next) split; in the General CHAID (GCHAID) Models module, observations with missing data for any categorical predictor are eliminated from the analysis. See also, Missing Data in GC&RT, GCHAID, and Interactive Trees for additional details.

  • You can perform "what-if" analyses by interactively deleting individual branches, and growing other branches, and observing various results statistics for the different trees (tree models).

  • You can automatically grow some parts of the tree but manually specify splits for other branches or nodes. For example, if certain predictor variables can in practice not easily or economically be measured (e.g., information on personal Income is usually difficult to obtain in questionnaire surveys), then you can find and specify alternative predictors and splits for nodes to avoid such variables (e.g., replace Income with Number of rooms in primary residence).

  • You can define specific splits. This is useful when you want to build simple and parsimonious solutions that can easily be communicated and implemented (e.g., a split on Income < 20,345 is less "convenient" then a split at Income < 20,000).

  • You can quickly copy trees into new projects to explore alternative splits and methods for growing branches.

  • You can save entire trees (projects) for later use. When you reload the tree projects, the tree will be restored to the exact state as when it was saved.

Methods for Building Trees for Regression and Classification

The STATISTICA system includes a very comprehensive selection of algorithms for building trees for regression and classification tasks.

Methods for automatic model building (machine learning). The original purpose and traditional application for tree classification and regression techniques was to provide an alternative to the various linear and nonlinear methods for predictive data mining; this is further described in the topics on Classification Trees Analysis, which implements the QUEST (Quick, Unbiased, Efficient Statistical Trees) algorithm developed by Loh and Shih (1997); General Classification and Regression Trees (GC&RT), which includes a complete implementation of C&RT methods (see also Breiman, Friedman, Olshen, & Stone, 1984); and General CHAID Models (GCHAID). These modules offer complete and in-depth implementations of powerful techniques, and are particularly well suited for inclusion in predictive data mining analyses. All of these methods will automatically find a best model (tree) using sophisticated algorithmic methods, and, once the general analysis problem has been defined (e.g., variables have been selected), they require little or no intervention on the part of the user to find good solutions that yield accurate predictions or predicted classifications. In fact, in many instances these techniques will generate models that are superior to any other linear, nonlinear, or Neural Networks-based solutions (see Hastie, Tibshirani, and Friedman, 2001, for an overview; see also, Nisbet, R., Elder, J., & Miner, G., 2009).

Building trees interactively. In contrast, another method for building trees that has proven popular in applied research and data exploration is based on experts' knowledge about the domain or area under investigation, and relies on interactive choices (for how to grow the tree) by such experts to arrive at "good" (valid) models for prediction or predictive classification. In other words, instead of building trees automatically, using sophisticated algorithms for choosing good predictors and splits (for growing the branches of the tree), a user may want to determine manually which variables to include in the tree, and how to split those variables to create the branches of the tree. This enables the user to experiment with different variables and scenarios, and ideally to derive a better understanding of the phenomenon under investigation by combining her or his expertise with the analytic capabilities and options for building the tree (see also the next paragraph).

Combining techniques. In practice, it may often be most useful to combine the automatic methods for building trees with "educated guesses" and domain-specific expertise. You may want to grow some portions of the tree using automatic methods and refine and modify the choices made by the program (for how to grow the branches of the tree) based on your expertise. Another common situation where this type of combined automatic and interactive tree building is called for is when some variables that are chosen automatically for some splits are not easily observable because they cannot be measured reliably or economically (i.e., obtaining such measurements would be too expensive). For example, suppose the automatic analysis at some point selects a variable Income as a good predictor for the next split; however, you may not be able to obtain reliable data on income from the new sample to which you want to apply the results of the current analysis (e.g., for predicting some behavior of interest, such as whether or not the person will purchase something from your catalog). In this case, you may want to select a "surrogate" variable, i.e., a variable that you can observe easily and that is likely related or similar to variable Income (with respect to its predictive power; for example, a variable Number of years of education may be related to Income and have similar predictive power; while most people are reluctant to reveal their level of income, they are more likely to report their level of education, and hence, this latter variable is more easily measured).

The STATISTICA Interactive Trees (C&RT, CHAID) module provides a very flexible and easy to use environment to grow trees or portions (branches) of trees algorithmically (automatically) as well as manually. It adds an extremely powerful tool for interactive data analysis and model building that may supplement and augment the many other techniques available in STATISTICA Data Miner for automatically determining valid models for prediction and predictive classification.

Comparison of Interactive Trees and
GC&RT and GCHAID

While the methods available in the Interactive Trees module are very similar and in some ways more flexible than those provided in the General Classification and Regression Trees (GC&RT) and General CHAID (GCHAID) modules, there are some important differences in these procedures, as well as specific advantages and disadvantages.

V-Fold Cross-Validation of Trees and Tree-Sequences

The technique of v-fold cross-validation is used in various analytic procedures of STATISTICA to avoid overfitting of the data. In short, when fitting a model to a data set, you can make the model so complex (i.e., include so many model parameters) that it will reproduce the data almost perfectly. The problem, of course, is that such models will likely do much worse when applied to new data, i.e., when used to make predictions or compute predicted classifications for cases or observations that were not used for fitting the model. In general, when fitting a complex model to relatively small data sets (relative to the complexity of the model) there is always the problem of fitting esoteric aspects of the specific sample in the analysis (learning sample, from which parameters are estimated), which will not generalize to the population at large and, hence, are not useful for prediction.

In several modules of STATISTICA, such as the General Classification and Regression Trees (GC&RT) or Classification Trees modules, you can use v-fold cross-validation methods to fit the entire sequence of trees (built by the respective algorithm for building the tree) v times to different subsamples of the data, and evaluate the predictive power (validity) of the respective models. This technique enables you to evaluate the steps of the automatic tree building process, and to select (automatically) a particular tree that appears most stable and valid, i.e., produces the most accurate predictions in observations that were excluded from the model (tree) building procedure itself. V-fold cross-validation of the tree sequence is an extremely powerful and useful tool for predictive data mining because it often yields simple models with optimum predictive validity.

In the Interactive Trees (C&RT, CHAID) module, v-fold cross-validation can be applied only to individual trees, not to the entire tree sequence. Since you (the user) can build trees interactively and grow and remove branches of the tree without necessarily building a sequence of trees (from simple to more complex), the program will not apply v-fold cross-validation to pick a best tree. Instead, you can only compute v-fold cross-validation error (cost) estimates for the automatically built final (chosen) tree, to get a basic idea of how useful the solution might be for predicting new observations.

Interactive trees may or may not generalize to new observations. The inability to apply v-fold cross-validation methods to an algorithmically (systematically) computed sequence of trees to select the solution with the greatest predictive validity is a major drawback of all programs that use interactive building of trees. In short, you cannot be sure whether the final solution for a tree model that you decided on after detailed interactive analyses will generalize to new samples, i.e., yield accurate predictions or predicted classifications for new cases.

Differences in Computational Procedures

Another issue that you should be aware of when using the various methods for building trees available in the STATISTICA system is that these techniques may not yield unique solutions. When using, for example, least squares Multiple Regression, usually there is a best and correct solution for the parameters in the linear regression equation. This is not necessarily the case for the tree building techniques.

For example, consider the CHAID algorithm (see also Basic Tree-Building Algorithm: CHAID and Exhaustive CHAID). As a first step, the algorithm will create categorical predictors out of any continuous predictors by dividing the respective continuous distributions into a number of categories with an approximately equal number of observations. In the General CHAID module, this step is repeated at each node of the tree to achieve the greatest predictive power at each stage of the automatic tree building process. In the Interactive Trees (C&RT, CHAID) module, by default, this procedure of "cutting" the range of values for continuous predictors into discrete ranges is only performed once, at the beginning of the analysis. This will speed up the analyses for very large data sets, provides greater responsiveness for interactive analyses, and avoids unnecessary computations since users will often choose ranges for splits manually anyway. However, with complex data sets that yield increasingly complex trees (split rules), these two procedures (initializing ranges for continuous predictors at each node vs. only once at the beginning of the analysis) can result in completely different trees. (Note that the Interactive Trees module contains various options to control when and how to create ranges from the values of continuous predictors.)

Another example in the context of CHAID where slight differences in the data or computational procedures can yield very different results would be when the program, during automatic operation (growing of the tree), encounters two predictor variables that would produce equal improvements to the overall fit, and the question is which one to choose. At some point, this decision is arbitrary, but how that decision is resolved (which variable is chosen) can greatly affect the subsequent choices and, hence, the overall tree (in fact it is not uncommon for some programs to produce different results for different re-orderings of predictor variables).

Another instance where CHAID or C&RT analyses performed with GCHAID or GC&RT vs. Interactive Trees may yield different results is in cases when the input data contains many missing data in the predictor variables. These differences are explained in Missing Data in GC&RT, GCHAID, and Interactive Trees.

A fourth example of how similar tree building methods can produce quite different results might be when you use surrogate splits: Surrogate split variables can be chosen (manually or automatically) for a split when the predictor values for some observations are missing. The program can then simply use a similar surrogate variable to perform the split. Depending on the number of surrogates allowed, you may again obtain very different results.

Tree-building techniques are heuristic algorithms, not exact estimation methods. The general point to remember is that the methods for building trees are based on algorithms rather than precise analytic solutions to equations. These algorithms are usually very sophisticated and in practice can often yield useful and interpretable results. However, the specific solutions are not "exact" in the sense that other analysts or researchers will always obtain identical, or nearly identical, results given similar data and software (algorithms). In that respect, these techniques are not unlike other Machine Learning algorithms that often produce very useful results (for prediction or predictive classification) even though those results are not necessarily unique (e.g., other trees may exist that are equally useful), interpretable, or allow for inferences about some general population "parameters."

Program Overview

The Interactive Trees (C&RT, CHAID) module contains complete implementations of classification and regression trees (C&RT) algorithms popularized by Breiman et al. (Breiman, Friedman, Olshen, & Stone, 1984; see also Ripley, 1996) as well as CHAID and Exhaustive CHAID methods (Chi-square Automatic Interaction Detector; see Kass, 1980). The general computational methods are mostly identical to those implemented in the General Classification and Regression Trees (GC&RT) and General CHAID (GCHAID) Models modules of STATISTICA, and they are described there in greater detail.

The Interactive Trees (C&RT, CHAID) module provides a large number of options to enable users to interactively determine all aspects of the tree building process. You can select the variables to use for each split (branch) from a list of suggested variables, determine how and where to split a variable, interactively grow the tree branch by branch or level by level, grow the entire tree automatically, delete ("prune back") individual branches of trees, and more. All of these options are provided in an efficient graphical user interface, where you can "brush" the current tree, i.e., select a specific node to grow a branch, delete a branch, etc.

A large number of results options are provided for displaying and reviewing the tree, which are similar to those available for the General Classification and Regression Trees (GC&RT) and General CHAID (GCHAID) Models  modules. Trees can be reviewed in tree diagrams (graphs) or the unique Tree Browser user interface (see General Computation Issues and Unique Solutions of STATISTICA GCHAID), which is similar to the browser (or the hierarchical folder structure) available in STATISTICA Workbooks. In addition, various auxiliary results tables and graphs are available to allow users to examine all details of the results.

As in all modules for predictive data mining, the decision rules contained in the final tree built for regression or classification prediction can optionally be saved in a variety of ways for deployment in data mining projects, including C/C++, STATISTICA Visual Basic, or PMML. Hence, final trees computed via this module can quickly and efficiently be turned into solutions for predicting or classifying new observations.