Quality Control and Process Optimization Introductory Overview

The revolutionary changes in technology over the past 30 years have dramatically affected the engineering of manufacturing and service delivery processes and the monitoring of those processes to improve and verify the quality of final products. Many, if not most, manufacturing processes are entirely automated and automatically monitored with sensors and measurement devices yielding a large amount of information relevant to final product quality and overall process yield. The purpose of STATISTICA Process Optimization is to provide tools and methods for analyzing the information provided by the production or service delivery process and to identify trends, patterns, and cause-effect relationships that can be exploited to further improve the quality and yield of the processes under investigation.

Data mining. The STATISTICA Data Miner topics provide a comprehensive overview of many of the issues and solutions for traditional data mining applications. You may want to review these topics for a general introduction to the various techniques and the architecture of STATISTICA Data Miner.

How Data Mining Can Help in Quality Improvement, Yield Optimization, and Root Cause Analysis

Data mining is an analytic process designed to explore typically large amounts of data in search of consistent patterns and/or systematic relationships among variables. STATISTICA Data Miner contains a variety of analytic tools to address the types of problems to which data mining techniques are usually applied. In general, the goal of data mining projects is to identify predictive models for the data so that predictions can be made for new observations where the outcome variable(s) of interest have not yet been observed, or so that the predictor variables of interest can be studied and adjusted to obtain the optimum outcome(s).

Data mining techniques are popular mostly in the areas of customer relationship management (CRM), marketing, risk management, and fraud detection (e.g., see Rud, 2001, for various applications of this kind). However, the general tools that have been developed for these purposes are applicable to any large data set where one wants to predict continuous or categorical outcome variables, or "understand" the mechanisms responsible for particular outcomes. Hence, the large number of powerful tools available in STATISTICA Data Miner can provide tremendously useful insights into the complex mechanisms involved in manufacturing or services delivery to improve the quality and yield of the final products. Below are brief descriptions of typical types or "classes" of problems that can be addressed with data mining techniques.

Predictive classification. Suppose the outcome of a process is categorical in nature. In the simplest case, imagine that the final product of a complex production process culminates in a simple designation of that product as "acceptable" vs. "not acceptable."  In the terminology of quality control (see STATISTICA Quality Control Charts), the outcome variable of interest would be an attribute. If some (or a large number of) predictor measurements are available (were measured) that might potentially be related to those outcomes, then data mining techniques could be applied to find the particular variables that have the greatest impact on the final quality of the product, and to build a model for how those predictors affect final quality (the final designation of an acceptable vs. non-acceptable product). This knowledge can then be used to improve the process to maximize the percentage of products that are of acceptable quality. STATISTICA Process Optimization contains a large number of advanced algorithms that will automatically detect the important predictors of categorical outcomes of interest, and build models that can be used to optimize the process to yield a larger proportion of desirable outcomes.

Prediction of continuous outcomes (e.g., yield). Suppose the outcome of a process is continuous in nature. Probably the most common type of outcome variable of this type would be process yield, i.e., the amount, total quantity, or value of the final product(s). As described in the previous paragraph, an understanding of, or valid predictive model for, final yield would be of obvious utility because it could enable engineers to optimize the final return on investment (ROI) in the machines and personnel necessary to establish and operate the production or service delivery process under investigation.

Identifying clusters of similar objects. Another typical application for general data mining tools is also sometimes referred to as "unsupervised learning."  This term denotes that the learning or model building process is not guided by a particular continuous or categorical outcome variable, but instead is aimed at uncovering some kind of structure in the data. For example, suppose a process yields a complex product or outcome with numerous aspects that can be measured to summarize the final quality. It can be informative to identify clusters of typical patterns (of measurements, e.g., defects) that are found, which could then be related back to some continuous or categorical predictor factors in the production process. One might also be interested in similarities between variables or measurements, to determine common underlying factors that are responsible for the values observed in those variables. STATISTICA and STATISTICA Data Miner offer numerous methods for clustering of observations or variables.

Special Issues in Process Optimization

So far, this introduction has centered mostly on the common data analysis problems that occur in manufacturing and service delivery and other domains more typically addressed with data mining methods. There are, however, some differences and issues that are more commonly found in production environments as compared to other domains such as risk analysis, fraud detection, customer relationship management, etc.

Large numbers of predictors and interactions: Feature selection and root cause analysis. Many types of manufacturing processes that only a few years ago were extremely labor intensive are now entirely automated. For example, the manufacture of silicon chips is a highly automated process involving detailed and sophisticated measurements at each step. As a result, a very large amount of information is collected, sometimes thousands of variables are measured for each wafer that is produced, and the challenge is to quickly and automatically determine the variables (measurements) that are related to final outcomes (e.g., yield) or problems. In other words, the goal is to quickly select the "features" of the production process that represent the key "roots" of the final quality and quality problems. The methods used for these purposes are often referred to as feature selection and root cause analysis.

STATISTICA Process Optimization contains several designated tools for automatic feature selection (from even hundreds of thousands of potential predictors) and root cause analysis that will not only determine the individual factors that cause particular observed outcomes, but will also check for possible interactions among factors, and automatically identify the best types of models that can account for the observed relationships between the causes and effects of quality problems. See also, Feature Selection and Root Cause Analysis and Feature selection with interaction effects for additional details.

Highly nonlinear, complex, and changing models. A particular challenge of data mining in automated production or complex service delivery environments is that the relationships between predictors (measurements of factors that potentially affect the final quality or yield) is often nonlinear, or even non-monotone. For example, it is rather the exception than the rule when simple relationships emerge such as "the higher the temperature of machine X the better the quality of product Y." Instead, the mechanisms and, hence, best models that link the root causes of quality or yield to the final outcome are highly nonlinear and difficult to describe with standard statistical models (such as linear regression, logistic regression, etc.). In addition, the relationships that emerge are often highly esoteric to the specific production process in question, and different plants following the same manufacturing process may identify entirely different problems, and root causes for those problems.

STATISTICA Process Optimization contains several designated automatic tools that can scan a large number of measurements (features) for general (linear, monotone, nonlinear, and non-monotone) relationships, as well as tools to automatically scan different classes of predictive models [e.g., such as linear models, neural networks, tree-based models, Boosted Trees, and Multivariate Adaptive Regression Splines (MARSplines)] to identify the types of analyses that are most likely to identify the specific relationships between the chosen predictors and the outcome variable of interest.

Predictive quality control: Projecting measured quality forward in time. STATISTICA Process Optimization also contains specifically designed tools to identify recurrent patterns and autocorrelations in the quality or yield over time, i.e., over successively measured items. Interest in these types of techniques has only recently emerged as traditional time series methods are introduced to manufacturing and process control (see, for example, Firmin, 2002, and his notion of "the fab as a time machine"). STATISTICA Process Optimization does not (solely) rely on traditional time series modeling methods (which are also available and described in detail in the Time Series Overviews), but can instead apply more general neural networks-based techniques adapted to the time domain.

In standard quality control charting, the engineer typically evaluates a chart or set of charts that shows the quality characteristic of interest over successively drawn samples from the production process.

STATISTICA Process Optimization provides tools to automatically analyze the sequence of measurements (and ranges) over successive samples for one or more variables and to identify autocorrelations and cross-lagged correlations and  recurrent patterns and periodicity (using spectral or Fourier analysis).

For example, consider the following example chart created from STATISTICA Process Optimization:

It is apparent that there are clear "waves" or periodicity over successive samples, which are well represented by the model (the smoother line); moreover, the prediction or forecast at the point of the last observation is clearly for a further downward trend in variable Max. Temperature in samples that have yet to be drawn. STATISTICA Process Optimization will automatically review several neural network architectures to find the best predictive model for one or more quality characteristics. Technically, the program will perform a multivariate time series analysis using different neural network architectures, and taking into account (when applicable) not only the measurements themselves (e.g., sample means) but also their ranges (e.g., values of the range chart; see also Quality Control Charts for details).

Other Features of Process Optimization

In addition to tools for feature selection, automatic model selection and root cause analysis, and predictive quality control charting, STATISTICA Process Optimization also provides enhanced tools for quality control charting of multiple variables (see also, Quality Control Charts for a review and description of standard quality control charting procedures).  Specifically, the program offers options for generating multiple charts for lists of variables, as well as summary charts for multiple variables, called multiple stream charts or group charts.

Multiple stream or group charts are a convenient way to summarize multiple process measurements in a single chart. The chart is constructed as follows: Suppose you have 4 machines producing identical products. Hence, there are 4 different processes that, if the process is in control, should generate identical measurements for all, with some small error. As consecutive samples are taken from each process, the program will compute the maximum and minimum values for the respective quality characteristic of interest (e.g., the mean, range, or defect count). The minimum and maximum values over consecutive samples will be plotted in a multiple line chart (one line for the respective sample minima, the other for the maximum values).

Now suppose that one of the "streams" (e.g., machines) is systematically producing items yielding measurements that are larger than those produced by any other stream. This situation is depicted in the following illustration:

Evidently, process stream 1 generated 10 successive sample measurements that were larger than any of those obtained from the other process streams. This indicates that "something is wrong." If all process streams would produce items and sample means that are identical (plus/minus some small error), then the likelihood of obtaining a series of 10 sample measurements from the same machine or stream that are larger than all the others is extremely unlikely. Hence, process stream 1 (and the process overall) is out of control.

This type of chart can be extremely useful in cases where the engineer wants to monitor multiple processes simultaneously in a single chart. Note that this chart can be generated by parts as well to yield a short-run multiple stream (group) chart.