MARSplines Example

Data file. This example is based on the data file Poverty.sta. Open this data file via the File - Open Examples menu; it is in the Datasets folder. The data are based on a comparison of 1960 and 1970 Census figures for a random selection of 30 counties. The names of the counties were entered as case names. This example data file is also discussed in Multiple Regression - Example 1: Standard Regression Analysis.

Research question. Now, analyze the correlates of poverty, that is, the variables that best predict the percent of families below the poverty line in a county. Thus, you will treat variable 3 - PT_POOR as the dependent or response variable, and all other variables as the independent or predictor variables.

Starting the analysis. Select MARSplines (Multivariate Adaptive Regression Splines) from the Data Mining menu to display the Multivariate Adaptive Regression Splines Startup Panel. Click the Variables button to display a standard variable selection dialog, select PT_POOR as the dependent variable and all of the other variables in the data file as the independent variable list, and click the OK button.

At this stage, you can change the model factors, e.g. Maximum number of basis functions, Penalty, etc., located on the Startup Panel - Options tab. For example, you may want to adjust the maximum number of basis functions that can be added during model building, or you can decrease the penalty for adding basis functions. It is recommended that you set the maximum number of basis functions to as high a number as possible so that STATISTICA MARSplines can search for as many combinations as possible. To build even more complex models, you may want to increase the Degree of interactions among the input variables.

It is also recommended that you always prune your model by selecting the Apply pruning check box. This will reduce the chance of overfitting the data. Refer to the Introductory Overview for further details concerning the MARSplines method.

Reviewing results. Now click the OK button. This will start MARSplines training and display the Results dialog when a model is built. Here you can select options to review your results in the form of spreadsheets, reports and graphs.

In the Summary box at the top of the Results dialog, you can see the specifications of the MARSplines model you just built, including the number of terms and basis functions present in the model, and the generalized cross validation error after training is complete (see also the Technical Notes for computational details). You will also see the choices you made in the Startup Panel displayed here for your reference including the dependent and independent variable list. When independent variables are highlighted in red, it means they have been selected in the formation of the basis functions for the final model. This is useful information since it identifies the important (relevant) predictor variables in the model. In this example, note that the predictor AGE is not identified as significant or important by the algorithm and, thus, does not participate in the formation of the basis functions.

Coefficients. MARSplines constructs the regression function(s) through the weighted sum of terms involving products of basis functions (see also the Introductory Overview). Click the Coefficients button.

The coefficients spreadsheet provides the full details of the MARSplines terms together with the corresponding model coefficients. It also indicates the type of each basis function and the order interactions in each term. In the spreadsheet above, the bias term (a constant) consists of the intercept coefficient. The first term consists of the basis function (POP_CNG - 7.1). Note that this basis function is highlighted in red indicating it is of type (x-t)+. The third term contains the product of (TAX_RATE - 0.4) and (7.5 - PT_PHONE). The last term consists of products of three basis functions and therefore its degree of interaction is three. In summary, the MARSplines model is:

PT_POOR = 23.211 - 0.197*max(0,POP_CHNG - 7.1) - 0.435*max(0,PT_PHONE - 75) + 1.14*max(0,TAX_RATE -0.4)*( 0,75 - PT_PHONE) + 0.0003*max(0,N_EMPLD  - 1070)*max(0,75 - PT_PHONE)*max(0,PR_RURAL - 5.9)

You can place the equation above into a standard STATISTICA Report by clicking the Equation button on the MARSplines Results dialog - Quick tab.

Regression statistics. Further information can be obtained by clicking the Statistics button MARSplines Results dialog - Quick tab, which will display a spreadsheet containing various regression statistics including R-square and Adjusted R-square. These statistics are similar to those computed for multiple regression and other linear models, and you may want to use them to assess and compare (with other models) the quality of the fit of the respective model to the data.

Reviewing the predicted and observed values. To further view results, you can display the Plots tab where you can create two- and three-dimensional plots of the variables, predictions, and their residuals. Note that you can display more than one variable in two-dimensional scatterplots.

For example, shown above is a scatterplot of the predicted and observed values plotted against the values in variable N_EMPLD. In general, this type of plot will give you an effective way of comparing model predictions with the observed data. To produce the graph shown above, select N_EMPLD from the X-axis list and Obsd. (Observed) and Pred. (Predicted) from the Y-axis list. Then click the Graphs of X and Y button.

The plot shown here was further customized using the graphics brushing tools; specifically, the point for the apparent outlier case (County) Shelby was labeled, to illustrate a major advantage of the MARSplines method over simple multiple linear regression or other parameterized regression models. Even though Shelby is a clear outlier, the overall fit of the model does not appear to be seriously degraded by this case. Unlike in multiple regression (see also the Multiple Regression Example 1: Standard Regression Analysis), where outliers can greatly affect the overall fit, outliers can easily be accommodated in MARSplines because of the "piecewise-regression-like" nature of the method (see also the Introductory Overview for details).