Intrinsically Nonlinear Regression Models - Models for Binary Responses: Probit & Logit

It is not uncommon that a dependent or response variable is binary in nature, that is, that it can have only two possible values. For example, patients either do or do not recover from an injury; job applicants either succeed or fail at an employment test, subscribers to a journal either do or do not renew a subscription, coupons may or may not be returned, etc. In all of these cases, you may be interested in estimating a model that describes the relationship between one or more continuous independent variable(s) to the binary dependent variable.

Using linear regression. Of course, you could use standard multiple regression procedures to compute standard regression coefficients. For example, if you studied the renewal of journal subscriptions, you could create a y variable with 1's and 0's, where 1 indicates that the respective subscriber renewed, and 0 indicates that the subscriber did not renew. However, there is a problem: Multiple Regression does not "know" that the response variable is binary in nature. Therefore, it will inevitably fit a model that leads to predicted values that are greater than 1 or less than 0. However, predicted values that are greater than 1 or less than 0 are not valid; thus, the restriction in the range of the binary variable (e.g., between 0 and 1) is ignored if one uses the standard multiple regression procedure.

Continuous response functions. We could rephrase the regression problem so that, rather than predicting a binary variable, we are predicting a continuous variable that naturally stays within the 0-1 bounds. The two most common regression models that accomplish exactly this are the logit and the probit regression models.

Logit regression. In the logit regression model, the predicted values for the dependent variable will never be less than (or equal to) 0, or greater than (or equal to) 1, regardless of the values of the independent variables. This is accomplished by applying the following regression equation, which actually has some "deeper meaning" as we will see shortly (the term logit was first used by Berkson, 1944):

y = exp(b0 + b1*x1 + ... + bn*xn)/{1 + exp(b0 + b1*x1 + ... + bn*xn)}

You can easily recognize that, regardless of the regression coefficients or the magnitude of the x values, this model will always produce predicted values (predicted y's) in the range of 0 to 1.

The name logit stems from the fact that you can easily linearize this model via the logit transformation. Suppose we think of the binary dependent variable y in terms of an underlying continuous probability p, ranging from 0 to 1. We can then transform that probability p as:

p' = loge{p/(1-p)}

This transformation is referred to as the logit or logistic transformation. Note that p can theoretically assume any value between minus and plus infinity. Since the logit transform solves the issue of the 0/1 boundaries for the original dependent variable (probability), we could use those (logit transformed) values in an ordinary linear regression equation. In fact, if we perform the logit transform on both sides of the logit regression equation stated earlier, we obtain the standard linear regression model:

p' = b0 + b1*x1 + b2*x2 + ... + bn*xn

Probit regression. You can consider the binary response variable to be the result of a normally distributed underlying variable that actually ranges from minus infinity to positive infinity. For example, a subscriber to a journal can feel very strongly about not renewing a subscription, be almost undecided, "tend toward" renewing the subscription, or feel very much in favor of renewing the subscription. In any event, all that we (the publisher of the journal) will see is the binary response of renewal or failure to renew the subscription. However, if we set up the standard linear regression equation based on the underlying "feeling" or attitude we could write:

feeling... = b0 + b1*x1 + ...

which is, of course, the standard regression model. It is reasonable to assume that these feelings are normally distributed, and that the probability p of renewing the subscription is about equal to the relative space under the normal curve. Therefore, if we transform each side of the equation so as to reflect normal probabilities, we obtain:

NP(feeling...) = NP(b0 + b1*x1 + ...)

where NP stands for normal probability (space under the normal curve), as tabulated in practically all statistics texts. The equation shown above is also referred to as the probit regression model. (The term probit was first used by Bliss, 1934).

Note: Generalized Linear/Nonlinear Model (GLZ). You can also use the Generalized Linear/Nonlinear Model (GLZ) module to analyze binary response variables. GLZ is an implementation of the generalized linear model and allows you to compute a standard, stepwise, or best subset multiple regression analysis with continuous as well as categorical predictors, and for binomial or multinomial dependent variables (probit regression, binomial and multinomial logit regression; see also Link Functions). In general, the estimation algorithms implemented in the Generalized Linear Models (GLZ) module are more efficient, and STATISTICA only includes these models here for compatibility purposes.