Working with Text Variables and Text Values: Ensuring Consistent Coding

The analytic and data management procedures of Statistica and Statistica Data Miner recognize various numeric data (variable) types as well as variables of type text. In addition, text labels may be specified for the values in numeric variables, and those text labels will be used to label the results from various analyses. For example, a variable Gender can be of type text, with the two values Male and Female; Gender could also be specified as a numeric variable with the integer codes 1 and 2, and text labels 1=Male and 2=Female. When using such variables in predictive data mining projects, it is important to use consistent variable types and coding (if the numeric data type is used) for all data sources throughout the project. Otherwise, misleading results may be computed (see also Using text variables or text values in data miner projects). Note that Statistica Data Miner includes specific options and tools (described in this section) to ensure consistency of coding; these tools are particularly useful and flexible when working with text variables, which may occur frequently when data sources specify connections to remote databases,(with Streaming Database Connectors)o.

Numeric variables with text labels

All analyses and data management operations performed on numeric variables will operate on numbers, regardless of whether or not text labels exist. This general rule applies to all interactive analyses performed via Statistica as well. For example, suppose you performed a classification analysis where one of the categorical predictor variables of interest is numeric with the two values 0 and 1, and labels 0=No and 1=Yes. You used one of the nodes marked with deployment, and thus automatically generated deployment information using this categorical predictor. When using the deployment information for a new data source, to compute predicted classifications, it is necessary that the same coding (0=No, 1=Yes) is used for the categorical predictor; otherwise, the predicted values will not be correctly computed.

When this specific scenario described in the paragraph above occurs in a data mining project, Statistica issues a warning to remind you that you must ensure consistent coding for all categorical variables (selected for the analyses) that are numeric but contain text labels for specific values.

Text variables with text values: the text values manager

In most cases, when connecting to an external database via streaming database connectors, variables containing information about categories or classes will be recognized by Statistica as text variables; Statistica data spreadsheets also support variables of type text. All text values processed as categorical predictor or dependent variables (or selected as learning/testing or censoring indicator) in a data miner project will be coded consistently inside that project. However, note that the coding may be different in different data miner projects, or when such data sources are used for interactive analyses! Therefore, when developing your own custom nodes for predictive data mining, you will want to make use of the specific Statistica Visual Basic functions available for managing the coding of text values into numeric codes.

For example, linear discriminant function analysis will basically generate results for predicting group membership, so that you can classify new observations into group 1, 2, 3, etc. he translation of 1(2, 3..)  to the appropriate text labels (High Income) is performed by the program at the point of generating the final results spreadsheets or graphs. If you want to design a custom node that will perform the same translation (of group 1 to High Income), you can use the properties and methods of the TextValuesManager object to guarantee that your results are compatible with those generated by other nodes. This is particularly important, when categorical predictor variables (of type text) are used for computing predicted values or classifications. For example, the programmer must make sure that if during training the text value Male was coded as 1 (group 1), then it must also be coded in this manner (treated as group 1) during deployment, when computing predicted values.

The TextValuesManager Object

Shown below are the properties and methods of the TextValuesManager object (interface), displayed in the SVB Object Browser.  

 

The TextValuesManager object has functions for adding label/value pairs and retrieving them, label/value lookup, etc. Both the DataMiner and Spreadsheet objects (interfaces) have properties of type TextValuesManager, and methods for setting and getting (retrieving) the TextValuesManager. The data miner project will create a TextValuesManager object and persist it when the project is saved. The DataMiner object sets this instance of the TextValuesManager into each source spreadsheet before a Run, and before variable selections are made; the DataMiner object will otherwise de-activate this TextValuesManager, so that interactive analyses or other data miner projects running on the same data source can implement their own (consistent) coding for text variables (values). Because the in-place database connection also implements the Spreadsheet interface, this will work in the same manner for database connections for streaming database connectors.

While a Spreadsheet/Streaming Database Connector has a TextValuesManager set, it will forward text value related calls to the TextValuesManager, instead of handling those calls itself. Note that TextValuesManager only handles a specific subset of the Spreadsheet's variables, not all of them: only categorical variables (including the learning/testing indicator, and indicator for complete/censored observations; see also the Select dependent variables and predictors dialog) are handled by the TextValuesManager; for variables not handled by the TextValuesManager the normal processing will occur.  

The DataMiner objects has various functions allowing you to fully customize the TextValuesManger, including functions to set and unset the TextValuesManger in all data sources, and to retrieve (get) the TextValuesManger from a data source. You can also make entries into the TextValuesManger (from inside a node), clear it, or set the per-variable maxima, etc. (see below). Another function for the DataMiner object lets you disable the  TextValuesManger so that this mechanism (for ensuring consistent coding of categorical variables of type text) will not be used; in that case the default (ad-hoc) coding for text values on a per-analysis basis will be applied.

Managing the size of (number of entries in) the TextValuesManager

 The per - (text) variable maximum number of text/value associations is set by default to 1000, but can be changed programmatically from inside node scripts. The absolute maximum number of entries in the TextValuesManager is 10 million, and can also be changed programmatically.    

As you apply the same project to different data sources with many text variables and values, the TextValuesManager may slowly fill up. Remember that the TextValuesManager persists as a data mining project is saved (and is restored, when it is loaded again), so in unusual situations when your analyses involve very many categorical text variables with many different text values, it may be useful from time to time to clear out all entries in the TextValuesManger. This can be done programmatically, from inside a node, or via option Clear Text Values from the data miner workspace Tools menu.

See also, Data Mining Definition, Data Mining with Statistica Data Miner, Structure and User Interface of Statistica Data Miner, Statistica Data Miner Summary, and Getting Started with Statistica Data Miner.