Response Optimization Example - Regression

Applying the Simplex Algorithm

To demonstrate the use of STATISTICA Response Surface Optimization, we start with a step-by-step example involving a regression problem in which the predicted (dependent) variable is the median house price in housing tracts in the Boston area. We use the BostonHousing.sta data set, which was originally collected by Harrison and Rubinfeld 1978 (further details are shown in the table below).

Crime Rate

Per capita crime rate by town

Residential Land Zone

Proportion of residential land zoned for lots over 25,000 sq.ft.

Non-retail Business acres

Proportion of non-retail business acres per town

Charles River

Charles river dummy variable (1 if tract bounds river; 0 otherwise)

Nitric Oxide

Nitric oxide concentration - parts per 10 million

Average Rooms

Average number of rooms per dwelling

Owner Occupied Units

Proportion of owner occupied units built prior to 1940

Distance to Employment

Weighted distances to five Boston employment centers

Accessibility to Highways

Index of accessibility to radical highways

Property Tax Rate

Full value property tax rate per $10,000

Pupil-Teacher Ratio

Pupil-Teacher ratio by town

% of Lower Status

% of the lower status of the population

Value of occupied homes

Median value of owner occupied-homes in $1,000s

Number of cases = 506

Several neural network models were trained for predicting the observed house prices in this data set, and the models were saved in PMML format. Our task is to use these models to apply a Response Optimization analysis to house prices in the Boston area.

We will consider a scenario where a real estate agent is in possession of this data set together with several neural network models that were trained to predict house prices. Our task is to find a house in the Boston area for a customer given fixed budgets. The customer might have preferences as to what kind of attributes the house must have, e.g., Crime Rate, Nitric Oxide level, Distance to Employment, etc. These preferences are actually a few attributes (independent variables) of the data set on which house prices in the Boston area are based. Our task is to find the best available deal for the customer. Note that the task here is to find a set of independent values, i.e., house attributes, that the customer wants in the purchased property given a fixed budget, i.e., the desired value of the response variable.

To start, open the file BostonHousing.sta file via the File - Open Examples menu; it is in the Datasets folder.

Next, select Response Optimization for Data Mining Models from the Data Mining - Process Optimization submenu to display the Response Optimization Startup Panel.

Click the Load models button to load the following XML files from the Examples/Datasets folder:

  1. GRNN_BostonHousing1.xml

  2. GRNN_BostonHousing2.xml

  3. GRNN_BostonHousing3.xml

  4. GRNN_BostonHousing4.xml.

These *.xml files contain trained Neural Networks of type GRNN in PMML language, which are particularly suited for predicting house prices given our data set.

Upon loading the predictive models, the Response Optimization module will automatically set the options to default values. These values are calculated from the statistics of the appropriate variables in the data set, such as mean, variance, minimum, and maximum. For example, the desired response (Quick tab) is set equal to the mean of the response (dependent) variable in the data set. Similarly, the descriptive statistics of the independent variables are used to initialize the starting, step size, minimum and maximum values of the Simplex algorithm (Simplex tab and Model exploration tab).

However, you may need to modify some of these default values as they might not be the best choice for your particular analysis. For instance, you may need to change the desired response (Quick tab) value to a setting of your choice. Some particularly useful information you can display in spreadsheet format is the descriptive statistics of the variables. To display this spreadsheet, click the Variables button in the Response Optimization Startup Panel.

This information helps you to select sensible option settings for your analysis. For example, the minimum and maximum of the variables can help you with the settings of the Simplex, Grid, and Random algorithms (see the documentation for the Search settings button on the Simplex, Grid, and Random tabs). By setting the minimum and the maximum of the Simplex algorithm, for instance, equal to the minimum and the maximum of the appropriate variables in the data set, you will in fact confine the Simplex search to regions of the independent space that was included in the original data set. This confinement is important since optimization may lead to unreliable results should it be conducted outside regions falling way off from the boundaries of the original data set.

Going back to the real estate scenario described previously, our task now is to find what kind of house our customer can get given a fixed budget. Let's assume that the budget is $20K. This means we have to set the value of the Seek target value option (in the Optimization type group box of the Quick tab) to 20.

Next, click the Optimize button in the Startup Panel to initiate the Simplex algorithm. While the algorithm is in progress, a progress bar will be displayed showing the predictive model undergoing optimization and the iteration number. When the search is complete, output in the form of spreadsheets and a graph will be displayed, the last of which is the plot of model predictions (y-axis) against iteration number (x-axis).

By reviewing this graph (which contains one plot per model) you can tell if the algorithm has succeeded in finding the desired value and how many iterations it took to converge. Note that the same information displayed in this graph can also be viewed in the form of a spreadsheet (Iterations, Simplex search spreadsheet).

The Results spreadsheet is perhaps the most important output.

Here you can view the final solutions found by the algorithm for each predictive model, i.e., the set of independent values (house attributes) for which the predictive models yielded  the desired response (desired house price). Note that each individual model has its own solution, which may vary from one model to another. This is mainly a "finite data size" effect, i.e., having a limited number of cases in the training data set. However, since data sets are always finite in size, such variations among predictive models may always exist. To alleviate this problem, we need to combine the existing predictive models (should there be more than one). Models combined to cooperate on making predictions are called ensembles, which are known to have a better generalization ability (i.e., to predict unseen data more accurately).

With STATISTICA Response Optimization, you can form ensembles out of existing models by selecting the Combine models check box on the Simplex tab. Using this functionality will enable you to form ensembles out of the existing models.

So, instead of making house price predictions based on one model, let's combine our models. To do so, select the Combine models check box and click the Optimize button once again.

When the search is complete a number of spreadsheets and graphs will be displayed. One particularly useful bit of information is the variance displayed by the ensemble. A large value should be a cause for concern since it is an indication that the predictive models yielded substantially different response values given the same set of independent values. Note that the agreement among the ensemble members can also be viewed from the iterations plot of the ensemble. This information is conveniently displayed as errorbars.

Note that you can repeat the same optimization for any house price the customer is willing to pay. In our next search we can, for example, double the price by setting the Seek target value option to 40, and then observe what kind of improvement the target property might have over those sold for a mere $20K.

A glance at the results spreadsheets for houses priced at $20K and $40K shows significant improvements in many property attributes such as Crime Rate and Average Rooms.

Finally, let's assume that the customer has strong preferences toward living in an area where the crime rate is low while he/she is still willing to pay only $20K. Thus, we want to find what kind of housing one can purchase given the price and subject to the condition Crime Rate = %0.1. To do this, select the Simplex tab and click the Search settings button to display a standard STATISTICA general user entry spreadsheet.

On the spreadsheet, locate the Crime Rate entry and change both the Starting value and Step size to 0.1 and 0, respectively.

Click the OK button to make your modifications permanent.

Note that, by setting the step size of the independent variable Crime rate to 0, you will force the Simplex algorithm to exclude this particular variable from the optimization process. On the Quick tab, change the value specified in the Seek target value option back to 20. Next, click the Optimize button once again, and compare the results with the previous ones.

What if the Simplex fails?

The Simplex technique is a guided optimization algorithm that can find the desired solution in a finite number of steps. However, just as any other algorithm, it may sometimes fail to find the desired solution. In cases such as this, you can use the Grid or Random algorithms as alternatives. These algorithms are implementations of simple techniques based on brute computing power. In this section, we will use the Random algorithm to search for houses in the Boston area. For the application of the Grid algorithm to optimization tasks see the step-by-step example for classification.

As before, we want to find the attributes of a $20K property located in an area with a crime rate of 0.1%, this time using the Random algorithm.

Select the Random option button on the Response Optimization Startup Panel Quick tab. As before, set the house price, using the Seek target value option, to 20.

Also, make sure that the mean and variance of the distributions for sampling from the independent variables are set to suitable values. The larger the variance of a sampling distribution the wider the range of the corresponding independent variable that will be explored by the algorithm. You can check these settings by clicking the Search settings button on the Random tab to display a STATISTICA user input spreadsheet, which you can use to modify the mean and variances. To fix the Crime rate at 0.1, set the mean for the sampling distribution of this variable to 0.1 and its variance to 0.

Also, make sure that the Sample size is set to a suitable value (on the Random tab). The larger the variances for the sampling distributions the more samples you need  in order to produce accurate results.

Next, click the Optimize button. When random sampling is complete, two spreadsheets will be displayed showing the settings of the Random algorithm and the search results.