Structure and
user interface of Dell Statistica Data Miner
Dell Statistica Data Miner is based on libraries
of more than 250 different nodes that contain the complete functionality
of Statistica, as well as specialized methods and functions for data mining.
The Statistica Data Miner user interface is
structured in the following manner:
Input data
First, variables are selected from a standard
variable selection dialog box. You can specify continuous and categorical
dependent and predictor variables, codes, case selection conditions, case
weights, etc. Thus, the description of the input, or what is referred
to as the Data Input Descriptor in the subsequent text, is serviced by
a common dialog box.
A note on variable names
In Statistica, ( in macros,
spreadsheet formulas, you can refer to variables by their names or numbers
( v1, v2, v3, ...); v0
is the case number. For some design-syntax based modules [ GLM], the Vx
convention is ambiguous as it is used generically to reference variables
by number. Hence variable names such as V1,
V2, etc., are not suitable when those modules will be referenced
in the Data Miner analysis). Additionally, repeated variable names (within
one spreadsheet) is not recommended for syntax-based modules, either.
The Node Browser: selecting analyses
Next, select from the library of available
scripts the desired type of analysis. Data Miner uses a flexible node
browser for this purpose that is fully customizable for particular
projects or jobs.
The Parameters dialog box
Statistica Data Miner
also contains dialog boxes for communicating with the analysis scripts
and property files, for example, in order to modify the parameters of
the analysis. Those dialog boxes use the information in the
.dmi files regarding defaults, types, enumerated constants, and
user access ( full access, read-only access, or hidden no access).
Nodes. Nodes are
the individual icons that connect the input (data) to the output (results).
Data flows through the nodes, where they are transformed, analyzed, etc.
In Statistica Data Miner, the nodes are classified according to how they
function: Data Acquisition nodes
(specification of input data), Data preparation,
cleaning, and transformation nodes, and Data
analysis, modeling, classification,
and forecasting nodes.
The general purpose of Statistica Data Miner
is to connect to one or more data sources, select variables or cases (observations)
of interest, apply various data verification and cleaning methods, and
then to perform analyses on those data sources and produce final results
tables or graphs that extract essential information from the input data.
All this can be accomplished in a very efficient and convenient user interface
that provides the means to quickly change analytic options, switch input
data, or move from a project to estimate the best parameters ( rules for
classification of some observed data) to one that deploys those estimates
in order to, for example, classify new data.
The general architecture of Statistica
Data Miner
Data
acquisition. Each analysis starts with the definition of the input
data. Click the New Data Source
button to select the data for the analyses. You can specify Statistica
input data spreadsheets or data sources representing connections to databases
on remote servers (Streaming Database Connector). To specify connections
to external databases, in the Create New Document
dialog box, select the Streaming
DB Connector tab,
and click OK to display the StreamingDB spreadsheet/interface, where
you can specify queries to the data.
The data InputDescriptor object and external databases
What is placed into the Data
Acquisition box is actually a descriptor of the input data, and
not necessarily the data spreadsheet itself. This distinction is very
important, as it holds one of the keys to the power and versatility of
the Statistica Data Miner system. Any input data source that can be mapped
into the data InputDescriptor
object can be used in Data Miner. The data
InputDescriptor object is further described in How
to Write .svx Scripts for Data Miner, for example, in the context
of analysis nodes.
Dirty nodes: specifying variables, case selections, etc
In addition to the actual data, the InputDescriptor contains information
about the types and nature of the variables that will be used in subsequent
analyses. Initially, when you first specify an input data file, those
variables are not known yet. Hence, and this is a convention used throughout
Statistica Data Miner, the input to the analysis is not fully specified,
and the respective icon in the project workspace is marked as dirty by
showing a red box around it.

The icon shown to the left is a dirty input
data icon; no variables have been specified for the analyses yet. When
you double-click on the dirty input data icon, a variable selection dialog
box is displayed in which you can specify various lists of variables and
codes for the analyses. Once the variable selection is complete, the red
box around the icon will be removed (see the icon to the right in the
illustration), and you have a clean icon that is updated and ready to
be connected to subsequent analysis nodes.
Data preparation, cleaning, transformation
Data cleaning is an often neglected but extremely
important step in the data mining process. The old adage garbage in, garbage
out is particularly applicable to typical data mining projects where large
data sets collected via some automatic methods ( via the web) serve as
the input into the analyses. Often, the method by which the data were
gathered was not tightly controlled, and so the data may contain out-of-range
values ( Income: -100), impossible
data combinations ( Gender: Male, Pregnant:
Yes), etc. Analyzing data that has not been carefully screened
for such problems can produce highly misleading results. You can access
numerous nodes on the Data tab
or in the Data folder in the Node Browser for variable transformations,
filtering, recoding, subsets, sampling, etc. The Data
Health Check Summary node is also a useful tool to examine the
data.
Statistica Feature Selection and Variable Screening
This tool is indispensable for very large
input data sets (containing hundreds of thousands of potential predictor
variable). The interactive Statistica
Feature Selection and Variable Screening facility is a unique tool
for mining huge terabyte-sized databases typically connected to Statistica
via the streaming
database connector so that the data do not have to be copied onto
your local machine, and so that all queries of the large database to retrieve
individual records can be performed on the server, using the database-specific
optimized tools. The Feature Selection/Screening module quickly searches
through thousands or hundreds of thousands of predictors for regression
or classification problems to find those that are likely best suited for
this task. The algorithms implemented in the program are general and they
do not assume any particular type or nature of relationships (e.g., linear,
quadratic, monotone, non-monotone, etc.). The Statistica Feature Selection
and Variable Screening module is a unique, extremely powerful tool for
mining huge databases.
Data analysis, modeling, classification, forecasting
This constitutes the meat of the analysis:
Input data from any source, after appropriate data cleaning and transformations
have been applied, are used as the input into subsequent analysis nodes,
which extract the nuggets of information contained in the data. All Statistica
analytic procedures can be used for this purpose, from simple descriptive
statistics, tabulation, or graphical analyses, to complex neural network
algorithms, general linear, generalized linear, generalized additive models,
etc. Even survival analysis techniques for censored observations can be
incorporated, as can quality control charting procedures for monitoring
ongoing active data streams.
Data analysis nodes can be selected from
among the large library of analytic routines contained in the Data Miner
directories of your Statistica installation. They connect to a complete
data InputDescriptor, and produce
either results nuggets, or data InputDescriptors
that can serve as the source for subsequent analyses. For example, you
can generate predicted and residual values via multiple regression, and
those predicted values can then be connected to subsequent data cleaning
or analytic nodes.
Analysis nodes with automatic deployment
Specialized
analytic nodes are available that will automatically generate information
for deployment. After these nodes have estimated a model, they make available
to all other nodes in the current Data Miner project the information necessary
to produce predicted values for new data. Nodes are available that will
combine the deployment information to compute, for example, a single predicted
classification via voting (bagging, averaging), etc.
How deployment information is stored
The deployment information, for the nodes located
in the Deployment folder of the
Node Browser, is stored in various
forms locally along with each node, as well as globally, visible to other
nodes in the same project. This is an important point to remember, because
for Classification and Regression, the Node Browser contains
a Compute Prediction from All Models node. This node computes predictions
based on all deployment information currently available in the global
dictionary, which can be reviewed via the Edit Global Dictionary
Parameters dialog box). Therefore, when building models for deployment
using these options, ensure that all deployment information is up to date,(based
on models trained on the most current set of data). See Examples
3 and 4
for illustrations on how to deploy projects.
Predicting new observations, when observed values are not (yet) available
One of the main purposes of predictive data mining (see Concepts
in Data Mining) is to allow for accurate prediction (predicted classification)
of new observations, for which observed values or classifications are
not (yet) available. An example of such an application is presented in
Example
3 (see also Example
4). When connecting data for deployment (prediction or predicted classification),
ensure that the structure of the input file for deployment is the same
as that used for building the models (see also option Data
for deployed project; do not re-estimate models in the Select dependent variables and predictors
dialog box). Specifically, ensure that the same numbers and types of predictor
variables are specified, that a (continuous or categorical) dependent
variable is specified (even if all values for that variable are missing),
and that the variable names match those in the data file used to build
the models (this is particularly important for the deployment of neural
networks, which will rely on this information).
Using text variables or text values in data
miner projects
When using text variables or variables with text values in data miner
projects with deployment, or projects that compare predicted classifications
from different nodes via Goodness of Fit nodes, you should be careful
to review the results to ensure that the coding of categorical variables
(with text values) is consistent across nodes. Generally, the program
will automatically ensure that identical coding is used across all nodes
in the same data miner project. However, when using numeric variables
with text labels as input data sources marked for deployment (see the
Select
dependent variables and predictors topic), or in some special data
analysis scenarios, it is important that you understand how such values
are handled in Statistica Data Miner. For additional details, refer to
Working
with Text Variables and Text Values: Ensuring Consistent Coding (see
also, How to Write .svx Scripts for
Data Miner).
Numeric variables with text values
With Statistica data spreadsheets, you can
specify numeric variables (of type integer, double, etc.), and attach
to specific values certain text labels. All analyses will be performed
based on the numeric representations, and not on the text representations
(which are only used to label results). Therefore, when using numeric
variables with text labels as categorical predictors (or dependent variables)
in input data sources marked for deployment, you must use the same coding
(number-label associations) as was used when the analysis nodes (modules)
generated the respective deployment information ( as was used in the training
data). For example, suppose a training data set contained a categorical
predictor variable Gender, with
Male coded as 1 and
Female is coded as 2, and you computed a linear model based on
that coding. When applying the linear model to a new data set to compute
predicted values, the same coding must be used. Otherwise, misleading
results may be computed.
Text variables (containing text values only)
may exist in Statistica data spreadsheets, and they commonly occur in
streaming
database connectors. Any module (or spreadsheet function) executing
inside a data miner workspace will use a (generated by the program) coding
scheme that will be consistently applied to all nodes in the same project.
Therefore, when using variables of type text as categorical predictors
(or dependent variable) in data mining projects that generate deployment
information, any input data sources from which you may want to compute
predicted values (deployment) also have to use text variables in the respective
places of the categorical predictor list. For example, if you computed
a linear model based on a list of predictors that included a text variable
Gender (Male, Female), then, when
applying this model (during deployment), the program also expects a variable
of type text, with the text values Male
or Female.
To summarize these rules, numeric variables
with text labels are always treated as numeric variables consistently
by all Data Miner nodes and Statistica modules. When using variables of
type text, the coding of individual text values (as levels of a categorical
predictor variable) are consistent for all nodes inside a particular Data
Miner project; note however, that this coding might be different when
you perform interactive analyses using any of the Statistica modules.
Again, for additional detailed information about these issues, refer to
Working
with Text Variables and Text Values: Ensuring Consistent Coding (see
also, How to Write .svx Scripts for
Data Miner).
Reports
Finally, the Data Miner project reveals the
nuggets of information that heretofore lay undetected in the data. Reports
are produced by practically all analytic nodes of Statistica Data Miner:
Parameter estimates, classification statistics, descriptive statistics
and graphics, graphical summaries, etc. These results are placed into
the Reporting Documents
folder in the workspace. You can use the options available in the
Options dialog box - Data
Miner tab to customize your program.
See Statistica
Data Miner Summary, Data
Mining with Statistica Data Miner, and Getting
Started with Statistica Data Miner. See also, Using
Statistica Data Miner with Extremely Large Data Sets, How
to Write .svx Scripts for Data Miner, and Global
Dictionary.