Controlling the Flow of Data through Nodes: Some Special Issues and Considerations

The Statistica workspace environment is extremely flexible and can be customized to accept a variety of data sources, with or without case selection conditions, case weights, etc. When writing a general-purpose Statistica Visual Basic scripts for managing a node it is important to be aware of the different types of data sources that can be connected to the InputDescriptors, and the types of operations that can be performed on them. For example, if the data source is a query to a remote database for in-place processing (see Streaming Database Connector Technology), you can usually not write data directly to the data source (since you will likely have no write-permissions for the input database). Moreover, Statistica Data Miner supports so-called mappings of variable lists into data sources, for managing efficiently data files with many thousands, hundreds of thousands, or millions of variables (columns). In order to design nodes that are general, i.e., can handle all of these data sources, it is important to understand the exact nature of the relationship between the InputDescriptor, the data source, and the analyses to be performed on the data source.

The DataSource property of the InputDescriptor

The DataSource property of the InputDescriptor is of type InputSpreadsheet. The InputSpreadsheet can be a Statistica data spreadsheet or a (remote) database connection via the Streaming Database Connector (see also Spreadsheets and Queries: IsDataBaseConnection below). If the input data source is a data spreadsheet and you write directly to the cells of that spreadsheet, you will change the original input data; if the data source is a database connection, then attempting to write to the original data will usually cause an error.

When transforming or recoding input data, you should use the InputDescriptor Clone method (see below) and options to first make a copy of the InputDescriptor and attached data before transforming the data to be passed on for further analyses on output. Here is an example:

Private Sub SubsetNode( _

    DataIn() As InputDescriptor, _

    DataOut() As InputDescriptor)

    ReDim DataOut(LBound(DataIn()) To _

UBound(DataIn())) As InputDescriptor

    For i=LBound(DataIn()) To UBound(DataIn())

' Clone the input InputDescriptor, and attach a "real" data

' spreadsheet to the clone

     Set DataOut(i)=DataIn(i).Clone(,scCloneAndMakeSpreadsheet)

     Dim ss As Spreadsheet

     Set ss=DataOut(i).DataSource

' Add a new variable to the data source (spreadsheet)

     ss.AddVariables("Random Numbers",ss.NumberOfVariables,1)

' Fill the new variable with random numbers

     ss.Variable(ss.NumberOfVariables).FillRandomValues

    Next i

End Sub

In this example, a new variable (data column) with random numbers is added to the input data, and passed on for further analyses by other nodes. Note that the Clone method has various options for dealing with case selection conditions, different data sources, etc.; these are described below (see Cloning the InputDescriptor: Managing the Flow of Data below).

Spreadsheets and Queries: IsDataBaseConnection

Statistica and Statistica Data Miner can analyze data in standard data spreadsheets or in (remote) databases via queries using Streaming Database Connector. You can determine the nature of the input data at run time by using the IsDataBaseConnection method for the InputSpreadsheet object (note that the DataSource property is of type InputSpreadsheet). For example:

Private Sub SubsetNode( _

    DataIn() As InputDescriptor, _

    DataOut() As InputDescriptor)

    For i=LBound(DataIn()) To UBound(DataIn())

     If DataIn(i).DataSource.IsDatabase=True Then

' Input data is a database connection

       .....

     Else

' Input data is a data spreadsheet

     End If

    Next I

...

End Sub

The distinction between data spreadsheets and database connections is important mostly for one reason. You cannot (usually) write data into a database. So, for example, if you want to transform or recode values in the input data, change missing data codes, delete cases or variables, etc., you first will have to copy the data to a spreadsheet before you can perform those operations. The most convenient way to guarantee that the data source of an input descriptor is a data spreadsheet (and not a database connection) is to use the Clone method and options for the InputDescriptor object. This is usually good practice anyway because if you change the values in a data source attached to an InputDescriptor directly, you may inadvertently change the original input data, i.e., the node that "feeds into" the current analysis or data cleaning/filtering node.

While this can be a useful option for special-purpose applications, it is usually not desirable behavior since data should flow from the input to the output without changing the input.

Case Selection Conditions and Case Weights

In Statistica workspaces, case selection conditions and case weights "belong" to the InputDescriptor, and not to the data file (DataSource). This is an important issue to remember: If you want to use case selection conditions or case weights in your node script, you need to use the SelectionCondition and CaseWeight properties of the InputDescriptor object, and not of the data source connected to the InputDescriptor. In fact, as described in Spreadsheets and Queries: IsDataBaseConnection, the DataSource property of the InputDescriptor can be a query to a remote database, and in that case, selection conditions or case weights are possibly not defined for the data source.

Case selection conditions, case weights, and Statistica analyses

If your node scripts rely on any of the standard Statistica analysis modules to perform certain data analyses (e.g., to apply neural network analyses), then you can pass the respective InputDescriptor object directly to the Statistica analysis (instead of the DataSource connected to the InputDescriptor). All Statistica analyses will properly handle the case selection conditions and weights by applying the respective properties specified for the InputDescriptor to the analyses, ignoring any selection conditions or weights specified for the DataSource. For example, in almost all standard node scripts supplied with Statistica workspaces you will see code such as the following:

Private Sub SubsetNode( _

    DataIn() As InputDescriptor, _

    DataOut() As InputDescriptor)

...

For i=LBound(DataIn()) To UBound(DataIn())

' Note: Here we pass the InputDescriptor itself, and not

' the DataSource property of the InputDescriptor

Set newanalysis = Analysis ( scNeuralNetworks, DataIn(i) )

...

Applying case selection conditions and weights to the data source. In some instances, you may need to apply case selection conditions or weights directly to the data source. The InputDescriptor has the methods ApplySelectionConditions, ApplyCaseWeights, RestoreSelectionConditions, and RestoreCaseWeights to apply case selection conditions or weights to the data source, or to restore the data source to the original condition.

Mapped Variable Selections: Map and UnMap

Statistica and Statistica Data Miner contain unique tools for managing and analyzing huge data sources with, for example, hundreds of thousands of variables (data columns; see also Streaming Database Connector and Feature Selection and Variable Screening). Managing such data sources efficiently as they flow through the data miner nodes (node scripts) requires some additional care. Specifically, the variable selections in the InputDescriptor can be mapped to a subset of all variables available in the DataSource attached to the InputDescriptor. The mapping allows you to make available to subsequent nodes only a subset of original variables, without the need to copy the actual data (from the input InputDescriptor data source to the output InputDescriptor data source). This is particularly important for nodes that implement custom feature selection/screening methods to be applied to huge input data.

Mapped data sources and user interface for variable selection

When a node creates on output (for subsequent analysis) an InputDescriptor with a mapped data source, then only those variables that were included in the map are available to the interactive user interface. However, when reviewing the input data file (document), the complete (original) data file will be left intact. For example, suppose the input data source consists of a database connection via Streaming Database Connector, and the query makes available 200,000 variables (data columns) on input. If a feature selection node creates a mapped data source on output (for subsequent analyses), and the map contains a selection of 100 variables (data columns), then the variable selection dialog box (see Select Dependent Variables and Predictors) will only make available (show to the user) those 100 variables to choose from (even though the underlying data source contains 200,000 variables). Also, the variables selected by the user ( variables 4-6...) will then reference the positions of variables in the map, and not in the original data source.

Managing mapped data sources

Because the variable lists selected by users (or programmatically selected in a node script) reference the relative positions of variables in the map, and not in the original data source, special functions are provided for mapping and un-mapping data sources, to make it easier to manage those maps. For example, consider the following prototype feature selection script:

Private Sub SubsetNode( _

    DataIn() As InputDescriptor, _

    DataOut() As InputDescriptor)

    ReDim DataOut(LBound(DataIn()) To _

UBound(DataIn())) As InputDescriptor

    For i=LBound(DataIn()) To UBound(DataIn())

     Dim MySubsetList() As Long

     Dim ss As Spreadsheet

     Set ss=DataIn(I).DataSource

' Call a subroutine to select the variables to map into the output;

' on output, array MySubsetList should contain all variables (variable

' numbers) to be selected (mapped) from the original data source.

     DetermineFeatureSelectionSubset MySubsetList

' Clone the input InputDescriptor and attach the mapping to the

' the variables in the input. The Clone function guarantees that

' any variable selections that exist in DataIn(I) are properly

' translated into the respective variable numbers to reference

' the mapped variable list (provided that all selected variables

' are part of MySubsetList).

     Set DataOut(i)=DataIn(I).Clone(ss.MappedSubset(MySubsetList))

    Next i

End Sub

 

The Clone function (see Cloning the InputDescriptor: Managing the Flow of Data below) in this example will create a new InputDescriptor object for output with a mapping of variables. No actual data are copied, so even if the original data source is huge (contains several terabytes of data), this operation will be instant. Also, all variable selections (e.g., of dependent variables or predictor variables) will be properly translated to reference the location of the variables in the map instead of the original data source; of course, this translation can only succeed if the selected variables are contained in the variable list that is mapped.

Mapped data sources in the data miner project workspace. To reiterate, as a result of the mapping operation, the InputDescriptor that is created has attached to it the original data source and a map to a subset of variables in that data source. For example, the Feature Selection and Variable Screening node of Statistica Data Miner will create such a mapping.

In the illustration above, note how the icon to the right of the Feature Selection and Variable Screening node indicates a database connection. In other words, no actual data were copied to the local computer, and the data source itself has not changed, only the map of variables available for subsequent analyses.

Mapping and unmapping variable lists

Of course, the programmer (developer) of Statistica Data Miner nodes (node scripts) has full control over the mapping and unmapping of data sources, and corresponding variable lists selected for the analyses. The InputDescriptor object has the two methods Map and UnMap to either map the attached DataSource to the variables used in the InputDescriptor (so that the map only contains those variables selected for the analyses) or UnMap the DataSource. For example, when writing a general-purpose node script where you want to be sure that the InputDescriptor variable lists contain references to the variables (data columns) in the original data source, you can use the UnMap in a While...Wend loop to undo any mapping that may be present.

...

...

While DataOut(i).IsMappedSubset

  DataOut(i).Unmap

Wend

...

...

As illustrated earlier in this section, you can use the spreadsheet MappedSubset(SubsetList) method to create mappings of arbitrary variable lists.

Cloning the InputDescriptor: Managing the Flow of Data

One of the most important methods of the InputDescriptor object for controlling the flow of data through the project is the Clone function. You can review a brief description of this function in Statistica Visual Basic Object Browser.

The Clone function copies all selections and specifications contained in the respective InputDescriptor to a new InputDescriptor. Hence, this is the most efficient method for creating InputDescriptor objects for subsequent analyses with other nodes ( InputDescriptors for output from a data cleaning/filtering or analysis node). The Clone function can be called with several arguments and options:

pSS as Variant

This optional argument enables you to clone an InputDescriptor and attach to the result a different data source; the data source can be a Spreadsheet or an InputSpreadsheet (e.g., a DataSource specifying a database connection).

cloneType as InputDescriptorCloneType

This optional argument enables you to specify whether the data source attached to the InputDescriptor after cloning should be identical to the input DataSource, a spreadsheet copy of the DataSource, etc. This option is very useful if you want to guarantee that the DataSource for the resulting InputDescriptor is a spreadsheet (and not a database connection). As discussed earlier in the context of the Spreadsheets and Queries: IsDataBaseConnection, when a node transforms or changes the input data in some way, it is usually necessary to first make a spreadsheet copy of the input DataSource to make sure that it is not a database connection. The following constants of type InputDescriptorCloneType are available as arguments:

  • scClone = 0. Clone the DataSource as is; for example, if the DataSource is a remote database connection, then the resulting InputDescriptor will have the same remote database connection assigned to its DataSource.

  • scCloneAndMakeSpreadsheet = 1. Clone the DataSource as a spreadsheet; if the DataSource is a mapped subset or database connection (query) then the DataSource for the resulting InputDescriptor will be a spreadsheet.

  • scCloneSelect = 3. This option has the same effect as scCloneAndMakeSpreadsheet, except that only those cases (observations) selected by the current case selection conditions (specified for the source InputDescriptor) are transferred into the resulting spreadsheet.

  • scCloneSubset = 2. This option has the same effect as scCloneAndMakeSpreadsheet, except that only those variables that were selected into the variable lists of the source InputDescriptor are transferred into the resulting spreadsheet.

  • scCloneSubsetAndSelect = 4. This option has the same effect as scCloneSubset, except that only those cases (observations) selected by the current case selection conditions (specified for the source InputDescriptor) are transferred into the resulting spreadsheet.

In general, when designing nodes for general purpose applications, you should make a cloned InputDescriptor with a spreadsheet DataSource whenever you intend to transform or recode data, delete cases or observations, add or delete variables, etc. If the node that you are designing strictly analyzes the data ( does not change it in any way), then it is not necessary to use any of the options that create a spreadsheet DataSource. In fact, in that case it is not desirable, because creating a spreadsheet involves the copying of data from the original DataSource to the new spreadsheet; with very large data sets, this can require unnecessary computing time and storage resources.

 

See also, Data Mining Definition, Data Mining with Statistica Data Miner, Structure and User Interface of Statistica Data Miner, Statistica Data Miner Summary, and Getting Started with Statistica Data Miner.