Missing Data in GC&RT, GCHAID, and Interactive Trees

When the predictor variables for a CHAID and/or C&RT analysis contain many missing data values, the results obtained via the General Classification and Regression Trees (GC&RT) Models and General CHAID (GCHAID) Models options may be different when compared to those computed by Interactive Trees. These differences are attributable to different ways in which missing data are handled in these modules.

Missing Data in GC&RT and GCHAID. Both GC&RT and GCHAID were designed to support ANCOVA-like predictor designs, i.e., combinations of continuous and/or categorical predictor variables. In some instances, these facilities are very useful, for example, in order to automatically code (and possibly detect) interactions between continuous (by categorical) predictor variables, or other custom (ANCOVA-like) designs as they can be specified in GLM, GRM, etc. However, this feature necessitates differences in the way in which missing data in the predictor variables can be handled.

Missing data in GCHAID. In GCHAID, missing data values in continuous and categorical predictor variables are generally deleted casewise. In other words, observations are excluded from the analyses if they have missing data in any of the predictor variables. If you do want to include missing data values (codes) explicitly in your analysis, you can always assign to them a particular value or code prior to the analyses, such as the mean for continuous predictors or a unique code for categorical predictors. By assigning a distinct numeric value to missing data, these data can be treated as valid observations in the analyses, and hence, such missing data values may emerge as important for the prediction of the outcome variable of interest.

Missing data in GC&RT. In GC&RT, missing data are essentially handled in the same manner as in GCHAID, but additional options for identifying surrogate split variables are also provided. Specifically, when an observation has missing data for a particular predictor variable chosen for a split, then by choosing a "similar" continuous predictor (surrogate) with valid data, that observation can still be classified (predicted; choose the Surrogate sample option on the GC&RT Results dialog - Observational tab). However, note that observations with missing data in any predictor are not included in the tree-building process itself, unlike in Interactive Trees, which handles missing data on a variable-by-variable basis, i.e., observations are only excluded from the tree-building analysis if they have missing data (and no surrogates) for a variable chosen for a particular split.

Missing data in Interactive Trees. In Interactive Trees, ANCOVA-like designs are not supported. Instead, variables can be "considered" by the respective tree-building algorithm one-by-one. For example, suppose you have two predictor variables X1 and X2, with many missing data for variable X2. At the root node (prior to the first split), all valid observations in each variable are considered to determine the best (next) split. If that split is performed based on the values in X1, then all observations will remain in the analysis; if that split is performed on X2 (and no surrogates are supported or requested in the respective analysis), then only those observations with valid data for X2 will remain in the analyses for subsequent splits. This is different from the way in which missing data are handled in GCHAID and GC&RT (as described above), where such observations are excluded at the root node level (although predictions may still be reported for such observations, if surrogate splits are supported and requested in the respective analysis).

Alternative Ways of Handling Missing Data. The upshot of all this is that you can expect sometimes very different results from GCHAID and GC&RT when compared to Interactive Trees for equivalent analyses, if the input data contain many missing values for the predictor variables. When this happens, it clearly indicates that the pattern of missing data over the predictor variables is itself an important predictor for the dependent (outcome) variable of interest, and worth investigating. So, for example, for categorical predictor variables it would be easy to specify a unique (and valid) data code to indicate missing data. Such values could then be included in all analyses, i.e., they may become important (diagnostic) values for splitting at particular nodes and, hence, for building the tree (e.g., "if Income=High or Missing then..."). Either way, when an input data set contains many missing data in the predictor variables, then the distribution of missing values can become an important predictor variable itself, and you may want to apply some initial data cleaning (see also Data Mining) and transformations to turn the "lack of observation" (missing data) into meaningful information.