Overfitting

In general, the term overfitting refers to the condition where a predictive model (e.g., for predictive data mining) is so "specific" that it reproduces various idiosyncrasies (random "noise" variation) of the particular data from which the parameters of the model were estimated; as a result, such models often may not yield accurate predictions for new observations (e.g., during deployment of a predictive data mining project). Often, various techniques such as cross-validation and v-fold cross-validation are applied to avoid overfitting.

Overfitting a model with one predictor. To demonstrate the concept of overfitting in statistics, consider the case of a data set with a single predictor variable. This is shown in the illustration below.

Although the relationship between the predictor and dependent variable is a smooth function (blue circles), in reality what we see in a statistical experiment is the original data "corrupted" with noise (red squares). Thus our task is to discover the true relationship (the U-shaped curve) between the predictor and the dependent variable given the noisy data set.

Underfitting the data. First we try to fit the red square points using a function that has too few parameters (a weak model), a straight line in this case. This is shown in the next illustration.

Note that, being too simple (inflexible), the model predictions (green line) cannot fit the data well (i.e. join the data cases together), thus missing the underlying structure in the data (note the differences between the blue circles and the green line). Such a model would not perform well on new data since it missed capturing the relationship between the dependent and predictor variables given its insufficient complexity. A model that is too simple for a given task is known to underfit. Such models will usually yield poor generalization ability, i.e., cannot predict well when presented with new data.

Overfitting the data. Let's now greatly increase the complexity of our model such that it will include as many parameters as we have observations in the data, i.e., fit a model with too many parameters. The result of fitting such a model is shown in the following illustration.

Note that the line depicting the model predictions passes through each data point, thus "discovering" the idiosyncratic (random noise) patterns in the training data that are not part of the true model in the population to which we want to generalize. This is known as overfitting, which is typical of overly complex models. Although such models can achieve very low (even zero) error on training data (since they are flexible enough to pass through each data point), they will perform poorly when presented with unseen (test) data (i.e., when we attempt to apply the model to a new sample of observations drawn from the same population).

Fitting a model of the "right" complexity. At this point it is clear that if we are to discover the true function (a U-shaped relationship) from what is presented to us by the sample data (red squares), then we need a model in between, i.e., a model that is more complex (with more parameters) than a straight line, but not so complex that it will not detect patterns in the data that are due to noise.

Such a model, as shown above, will capture the true relationship in the data best and, thus, can achieve the best accuracy for prediction.

Some of the techniques used in predictive data mining (e.g., neural networks, Classification and Regression Trees, etc.) to control model complexity (flexibility) and hence avoid overfitting are based on cross-validation, v-fold cross-validation and regularization (see STATISTICA Automated Neural Networks).