Streaming database connector technology (technical overview)

The purpose and advantages of streaming database connections

Summary. A streaming database connector is an advanced database access technology that supports a high-performance, direct interface between remote-server data and the analytic functionality of Statistica products. This technology accesses large databases using a one-step process that does not require creating local copies of the data. Streaming DB connectors significantly increase the performance of Statistica and are particularly well suited for large data mining and exploratory data analysis tasks

The source of streaming database connector performance gains. These performance gains are a result of not having to import the data into a local copy, as well as Statistica's asynchronous and distributed-processing architecture. Specifically, a streaming database connector uses the processing resources (multiple CPUs) of the database servers to execute the query, extract the requested data records, and send them to the Statistica workstation. Finally, Statistica simultaneously processes these records as they arrive.

Writing information back to an external database. With Statistica, you can also write certain information computed by the program back to the original input data file or database and, thus, integrate computed statistics into an existing database or data warehouse. Specifically, the Rapid Deployment of Models module enables you to write computed statistics (predictions, predicted classifications, classification probabilities, residuals) back into the current input data file; this capability to, for example, merge classification probabilities computed by various models into an existing database or data warehouse is extremely useful in the context of data mining applications to deploy models for extremely large data sets. See the Query Options dialog box topic for additional details regarding the specific settings of options to enable Statistica to write results to fields in an external database.

Compatibility with Statistica products

The streaming database connector technology can be used with both desktop and enterprise versions of Statistica products, and it is fully compatible with the Client/Server architecture of Statistica Enterprise Server (the requests can be made over the Web and data processed asynchronously by Statistica Enterprise Server computers connected to the (next-tier) database server computers that will execute the queries). Streaming DB connector is also optimized to seamlessly integrate with Statistica Data Miner, which supports multiple streaming DB connector data input channels.

Architecture and programmability

The streaming database connector technology is implemented around a COM object, which wraps an instance of a Microsoft ActiveX Data Object (ADO) Recordset object and implements a subset of the Spreadsheet COM interface in the Statistica Object Library. This works because all Statistica analyses access the source spreadsheet data via the spreadsheet interface. (Actually the InputSpreadsheet interface, which has a subset of the spreadsheet interface methods. This InputSpreadsheet interface is normally hidden in the Object Browser but can be seen by right-clicking in Object Browser and selecting "Show Hidden Members.") Therefore, to a Statistica analysis, the streaming DB connector looks just like a spreadsheet. Indeed, advanced users of Statistica could wrap an InputSpreadsheet interface around any data source at all, and perform Statistica analyses on it programmatically via the Statistica Object Model.

Behind the scenes, certain steps must be taken by the spreadsheet wrapper object to make analyses work seamlessly. For instance, if an analysis requires the number of cases in a recordset before that information is known, then either a separate count query will be executed synchronously (The analysis must wait until the count query returns before continuing) and the result returned to the analysis, or some arbitrary upper bound on the case count will be returned immediately. This behavior is configurable on the Streaming DB Connector tab of the Statistica Options dialog box. Also, if using a forward only cursor and the analysis must make multiple passes through the data or access the data in random order, then any request for a previous case (row) forces the streaming DB connector to re-query the database and advance the cursor forward to the requested case, since the cursor cannot be scrolled backward. The analysis would simply wait until this process is completed and the requested data were provided to it.

In the streaming DB connector Type Library, there are two main interfaces. DBTable provides programmatic access to the streaming DB connector Document, much as the Macro, Graph, and Spreadsheet interfaces provide access to Statistica Macros, Graphs, and Spreadsheets. In addition to the standard document methods and properties (Visible, Activate, Close, etc.) it provides access to all streaming DB connector specific options (cursor type, location, query string, etc.) Its read-only property Spreadsheet returns the Spreadsheet wrapper around the ADO Recordset.

The second interface is DBSpreadsheet. This interface is used internally by the streaming DB connector to create the Spreadsheet wrapper object, and could also be used by users writing their own macros or programs, although in most cases the DBTable interface is sufficient and will itself use a DBSpreadsheet object. This interface has two methods, Open and CreateNew. Open executes the supplied query and opens an ADO Recordset. It creates a Spreadsheet wrapper object and attaches the ADO Recordset to it, and returns this Spreadsheet object. CreateNew creates a Spreadsheet wrapper object that is not attached to any Recordset and therefore is not usable until you call its SetRecordset method to attach an ADO Recordset object of your own creation.

See also, the Cursor definition, Statistica Query - Streaming of Data on Remote Servers, and Streaming Database Connector FAQs.