Using Dell Statistica
Data Miner with Extremely Large Data Sets
The entire Statistica family of products and
Statistica
Data Miner in particular are specifically optimized to process efficiently
extremely large data sets, with millions of observations (records) and
millions of variables (fields).
Processing databases that are larger than the local storage device
Statistica Data Miner (and optionally other
Statistica products) can process data in (remote) databases "in-place"
via its highly optimized Streaming
Database Connector technology, which combines the processing resources
of the database server and the local computer to a) perform the queries
(using the database server CPU) while simultaneously b) processing the
fetched records "on-the-fly" on the local machine [using the
local computer (client)CPU]. This way, databases that are larger than
what could fit on the local machine can be processed,
and significant performance gains can be achieved by saving the time that
would normally be required to first import the data to the local device
and only then process them locally. Practically all common database formats
are supported, and powerful tools are provided for defining the database
connection (query).
Processing databases with extremely large numbers of variables (fields):
The unique Feature Selection and Variable Screening Facilities
When the number of variables in the input
data file is extremely large, Statistica Data Miner can automatically
select subsets of variables from among even millions of variables (candidates)
for predictive data mining. The extremely fast and efficient algorithm
will select variables (features) that are likely to be the most relevant
predictors in the current data set, without introducing biases into subsequent
model building for predictive data mining. See Feature
Selection and Variable Screening for details.
Processing data files with extremely large numbers of cases (records):
Flexible and efficient random sampling
Statistica products (including Statistica
Data Miner) can process data files with practically unlimited numbers
of cases (records), and Statistica's data access procedures are highly
optimized. However, including all records in the analyses when the number
of records is extremely large is
entirely unnecessary,
time consuming, and
often impractical or
impossible (in extreme cases it could take hours merely to read all records)
In order to speed up the analytic process,
Statistica Data Miner includes sophisticated tools for drawing representative,
perfectly random
samples from huge data sets (databases). You can quickly extract simple
or systematic random samples of appropriate sizes, with or without replacement,
from huge data sets (with many millions of records) for further analyses
with sophisticated modeling tools that may require multiple passes through
the data ( Statistica Automated Neural Networks (SANN),
Generalized Linear Models, etc.). The
random sub-sampling is based on Statistica's
validated random
number generator. Note that Statistica is one of only few commercially
available software products that have passed the most advanced and most
recognized tests for randomness (the DIEHARD
suite of tests).
See also, Data Mining
Definition, Data Mining
with Statistica Data Miner, Structure
and User Interface of Statistica Data Miner, Statistica
Data Miner Summary, and Getting
Started with Statistica Data Miner.