Multivariate Adaptive Regression Splines
(MARSplines) Introductory Overview
The STATISTICA Multivariate Adaptive
Regression Spines (MARSplines) module is a generalization of techniques
popularized by Friedman (1991) for solving regression (see also, Multiple Regression) and classification
type problems, with the purpose to predict the value of a set of dependent
or outcome variables from a set of independent or predictor variables.
MARSplines can handle both categorical
and continuous variables (whether response or predictors). In the case
of categorical responses, MARSplines
will treat the problem as a classification problem. This is in contrast
to the case of continuous dependent variables where the task is treated
as a regression problem. MARSplines
will automatically determine that for you.
MARSplines is a nonparametric
procedure that makes no assumption about the underlying functional relationship
between the dependent and independent variables. Instead, MARSplines
constructs this relation from a set of coefficients and basis functions
that are entirely "driven" from the data. In a sense, the method
is based on the "divide and conquer" strategy, which partitions
the input space into regions, each with its own regression or classification
equation. This makes MARSplines
particularly suitable for problems with higher input dimensions (i.e.,
with more than 2 variables), where the curse
of dimensionality would likely create problems for other techniques.
The MARSplines technique has
become particularly popular in the area of data
mining because it does not assume or impose any particular type or
class of relationship (e.g., linear, logistic, etc.) between the predictor
variables and the dependent (outcome) variable of interest. Instead, useful
models (i.e., models that yield accurate predictions) can be derived even
in situations where the relationships between the predictors and the dependent
variables is non-monotone and difficult to approximate with parametric
models. For more information about this technique and how it compares
to other methods for nonlinear regression (or regression trees), see Hastie,
Tibshirani, and Friedman (2001) and Nisbet, R., Elder, J., & Miner,
G. (2009).
Regression and Classification
Problems
Regression problems are used to determine the relationship between a
set of dependent variables (also called output, outcome, or response variables)
and one or more independent variables (also known as input or predictor
variables). The dependent variable is the one whose values you want to
predict, based on the values of the independent (predictor) variables.
For instance, one might be interested in the number of car accidents on
the roads, which can be caused by 1) bad weather and 2) drunk driving.
In this case, one might write, for example,
Number_of_Accidents = Some
Constant + 0.5*Bad_Weather + 2.0*Drunk_Driving
The variable Number of Accidents
is the dependent variable which is thought to be caused by (among other
variables) Bad Weather and Drunk Driving (hence the name dependent
variable). Note that the independent variables are multiplied by factors
0.5 and 2.0. These are known as regression coefficients. The larger these
coefficients, the stronger the influence of the independent variables
on the dependent variable. If the two predictors in this simple (fictitious)
example were measured on the same scale (e.g., if the variables were standardized
to a mean of 0.0 and standard deviation 1.0), then Drunk
Driving could be inferred to contribute 4 times more to car accidents
than Bad Weather. (If the variables
are not measured on the same scale, then direct comparisons between these
coefficients are not meaningful, and, usually, some other standardized
measure of predictor "importance" is included in the results.)
For additional details regarding these types of statistical models,
refer also to the Overviews for Multiple
Regression and General
Linear Models (GLM), as well as General Regression Models (GRM).
In general, in the social and natural sciences, regression procedures
are very widely used in research. Regression enables the researcher to
ask (and hopefully answer) the general question "what is the best
predictor of ..." For example, educational researchers might want
to learn what are the best predictors of success in high school. Psychologists
may want to determine which personality variable best predicts social
adjustment. Sociologists may want to find out which of the multiple social
indicators best predict whether a new immigrant group will adapt and be
absorbed into society.
On the other hand, classification problems are concerned with predicting
the value of discreet (categorical) response variables from a set of predictor
variables. Classification is used to predict membership of cases or objects
in the classes of a categorical dependent variable from their measurements
on one or more predictor variables. Classification analysis is one of
the main techniques used in data
mining. The goal of classification is to predict or explain responses
on a categorical dependent variable, and as such, the available techniques
have much in common with the techniques used in the more traditional methods
of Discriminant
Analysis, Cluster
Analysis, Nonparametric Statistics, and Nonlinear
Estimation. Imagine that you want to devise a system for sorting
a collection of coins into different classes (perhaps pennies, nickels,
dimes, quarters). Suppose that there is a measurement on which the coins
differ, say width, which can be used to devise a hierarchical system for
sorting coins. You might roll the coins edge down on a narrow track in
which a slot the width of a dime is cut. If the coin falls through the
slot, it is classified as a dime; otherwise, it continues down the track
to where a slot the width of a penny is cut. If the coin falls through
the slot, it is classified as a penny; otherwise, it continues down the
track to where a slot the width of a nickel is cut, and so on. You have
just constructed a classification model. The decision process used by
your classification model provides an efficient method for sorting a pile
of coins and, more generally, can be applied to a wide variety of classification
problems.
Multivariate Adaptive Regression
Splines
The car accident example we considered previously is a typical application
for linear regression, where the response variable is hypothesized to
depend linearly on the predictor variables. Linear regression also falls
into the category of so-called parametric methods, which assumes that
the nature of the relationships (but not the specific parameters) between
the dependent and independent variables is known a
priori (e.g., is linear). By contrast, nonparametric methods (see
also, Nonparametric Statistics) do not make
any such assumption as to how the dependent variables are related to the
predictors. Instead, it allows the model function to be "driven"
directly from data.
Multivariate Adaptive Regression Spines
(MARSplines) is a nonparametric procedure which makes no assumption
about the underlying functional relationship between the dependent and
independent variables. Instead, MARSplines
constructs this relation from a set of coefficients and so-called basis
functions that are entirely determined from the data. You can think of
the general "mechanism" by which the MARSplines
algorithm operates as multiple piecewise linear regression (see also,
Nonlinear
Estimation), where each breakpoint (estimated from the data)
defines the "region of application" for a particular (very simple)
linear equation.
Basis functions.
Specifically, MARSplines uses
two-sided truncated functions of the form (as shown below) as basis functions
for linear or nonlinear expansion, which approximates the relationships
between the response and predictor variables.

Shown above is a simple example of two basis functions (t-x)+
and (x-t)+
(adapted from Hastie, et al., 2001, Figure 9.9). Parameter t
is the knot of the basis functions (defining the "pieces" of
the piecewise linear regression); these knots (t
parameters) are also determined from the data. The "+"
signs next to the terms (t-x)
and (x-t)
simply denote that only positive results of the respective equations are
considered; otherwise the respective functions evaluate to zero. This
can also be seen in the illustration.
The MARSplines model.
The basis functions together with the model parameters (estimated via
least
squares estimation) are combined to produce the predictions given
the inputs. The general MARSplines
model equation (see Hastie et al., 2001, equation 9.19) is given as:

where the summation is over the M
nonconstant terms in the model (further details regarding
the model are also provided in Technical
Notes). To summarize, y is
predicted as a function of the predictor variables X
(and their interactions); this function consists of an intercept parameter
(
)
and the weighted (by
) sum of one or more basis functions
,
of the kind illustrated earlier. You can also think of this model as "selecting"
a weighted sum of basis functions from the set of (a large number of)
basis functions that span all values of each predictor (i.e., that set
would consist of one basis function, and parameter t,
for each distinct value for each predictor variable). The MARSplines
algorithm then searches over the space of all inputs and predictor values
(knot locations t) as well as
interactions between variables. During this search, an increasingly larger
number of basis functions are added to the model (selected from the set
of possible basis functions), to maximize an overall least squares goodness-of-fit
criterion. As a result of these operations, MARSplines
automatically determines the most important independent variables as well
as the most significant interactions among them. The details of this algorithm
are further described in Technical
Notes, as well as in Hastie et al., 2001).
Categorical predictors.
MARSplines is well suited for
tasks involving categorical predictors variables. Different basis functions
are computed for each distinct value for each predictor and the usual
techniques for handling categorical variables is applied. Therefore, categorical
variables (with class codes rather than continuous or ordered data values)
can be accommodated by this algorithm without requiring any further modifications.
Multiple dependent
(outcome) variables. The MARSplines
algorithm can be applied to multiple dependent (outcome) variables, whether
continuous or categorical. When the dependent variables are continuous,
MARSplines will treat the task
as regression; otherwise, it is a classification problem. When the outputs
are multiple, the algorithm will determine a common set of basis functions
in the predictors, but estimate different coefficients for each dependent
variable. This method of treating multiple outcome variables is not unlike
some neural
networks architectures, where multiple outcome variables can be predicted
from common neurons and hidden layers; in the case of MARSplines,
multiple outcome variables are predicted from common basis functions,
with different coefficients.
MARSplines and classification
problems. Because MARSplines
can handle multiple dependent variables, it is easy to apply the algorithm
to classification problems as well. First, MARSplines
will code the classes in the categorical response variable into multiple
indicator variables (e.g., 1 = observation belongs to class k, 0 = observation
does not belong to class k); then MARSplines
will fit a model and compute predicted (continuous) values or scores;
and finally, for prediction, will assign each case to the class for which
the highest score is predicted (see also Hastie, Tibshirani, and Freedman,
2001, for a description of this procedure). The above procedure is handled
by MARSplines automatically for
you.
Model Selection and Pruning
In general, nonparametric models are adaptive and can exhibit a high
degree of flexibility that may ultimately result in overfitting
if no measures are taken to counteract it. Although such models can achieve
zero error on training data (provided they have a sufficiently large number
of parameters), they have the tendency to perform poorly when presented
with new observations or instances (i.e., they do not generalize well
to the prediction of "new" cases). MARSplines,
such as most methods of this kind, tend to overfit the data as well. To
combat this problem, MARSplines
uses a pruning technique (similar to pruning
in classification trees) to limit the complexity of the model by reducing
the number of its basis functions.
MARSplines as a predictor
(feature) selection method. This feature - the selection of and
pruning of basis functions - makes this method a very powerful tool for
predictor selection. The MARSplines
algorithm will pick up only those basis functions (and those predictor
variables) that make a "sizeable" contribution to the prediction
(refer to Technical
Notes for details). The Results
dialog of the Multivariate Adaptive
Regression Splines (MARSplines) module will clearly identify (highlight)
only those variables associated with basis functions that were retained
for the final solution (model).
Applications
Multivariate Adaptive Regression Splines
(MARSplines) have become very popular recently for finding predictive
models for "difficult" data
mining problems, i.e., when the predictor variables do not exhibit
simple and/or monotone relationships to the dependent variable of interest.
Alternative models or approaches that you can consider for such cases
are CHAID,
Classification
and Regression Trees, or any of the many Neural Networks architectures available
in STATISTICA. Because of the
specific manner in which MARSplines
selects predictors (basis
functions) for the model, it does generally "well" in situations
where regression-tree models are also appropriate, i.e., where hierarchically
organized successive splits on the predictor variables yield good (accurate)
predictions. In fact, instead of considering this technique as a generalization
of Multiple
Regression (as it was presented in this introduction), you
may consider MARSplines as a
generalization of Regression
Trees, where the "hard" binary splits are replaced
by "smooth" basis functions. Refer to Hastie, Tibshirani, and
Friedman (2001) for additional details.
Program Overview
The STATISTICA Multivariate Adaptive
Regression Splines (MARSplines) module is an implementation of
techniques popularized by Friedman (1991) for solving regression and classification
type problems (see also Multiple Regression), with the main purpose
to predict the values of a set of continuous dependent or outcome variables
from a set of independent or predictor variables. There are a large number
of methods available in STATISTICA
for fitting models to continuous variables, such as a linear regression
[e.g., Multiple Regression, General Linear Model (GLM)], nonlinear
regression (Generalized Linear/Nonlinear Models),
regression trees (see Classification
and Regression Trees), CHAID, Neural Networks,
etc. (see also Hastie, Tishirani, and Friedman, 2001, for an overview).
The program will automatically select the best set of predictor variables,
or their interactions and report all model parameters required for interpreting
the model. Options are available to address the problem of over-fitting
by restricting the model complexity (maximum number of basis functions),
or by applying pruning after a model of maximum complexity has been fitted
to the data.
A large number of graphs can be computed to evaluate the quality of
the fit and to aid with the interpretation of results. Various code generator
options are available for saving estimated (fully parameterized) models
for deployment
in C/C++/C#, Visual
Basic, PMML
or STATISTICA Enterprise. (See
also, Using
C/C++/C# Code for Deployment.)