Singular Value Decomposition in Statistica Text Mining and Document Retrieval

The use of singular value decomposition (SVD) for feature extraction and latent semantic indexing in text mining is described in the Statistica Text Mining and Document Retrieval Introductory Overview. This program uses a particularly efficient algorithm for singular value decomposition that can handle even very large input matrices (of word counts and documents).

Assume matrix A represents an m x n word occurrence matrix where m is the number of input documents (files) and n the number of words selected for analysis. SVD computes the m x r orthogonal matrix U, n x r orthogonal matrix V, and r x r matrix D, so that A = UDV', and so that r is the number of eigenvalues of A'A.

Statistica Text Mining and Document Retrieval employs an efficient iterative method for computing SVD in order to handle the usually very large and sparse matrix A. This method produces accurate values for relatively large singular values, but may result in lower accuracy on small singular values, which are typically of little interest for the analyses. Specifically, this problem (of degraded accuracy for very small eigenvalues) is trivial in text mining where SVD is used for dimensional reduction and feature extraction, and thus relatively small singular values are not of interest. Note also that the maximum number of SVD eigenvalues in Statistica Text Mining and Document Retrieval is limited to 82.

Scree plot. To decide on the number of singular values that are useful and informative, and that should be retained for subsequent analyses, typically, a scree plot is created displaying relative sizes of the singular values (the diagonal elements of D). Usually, the number of "informative" dimensions to retain for subsequent analysis is determined by locating the elbow in this plot, to the right of which one presumably finds on the factorial scree due to random noise.

Word coefficients. The word coefficients reported by the program are computed as the matrix W such that AW = U.

Document scores. The document score matrix reported on the Text Mining Results dialog box - SVD tab is matrix U from the singular value decomposition.

Sum of squares of word residuals. This results spreadsheet reporting the word residuals (available from the Text Mining Results dialog box - SVD tab) shows the diagonal values of (A-UDV')'(A-UDV').

Word importance. The results spreadsheet reporting the word importance (available on the Text Mining Results dialog box - SVD tab) shows the relative sizes of the square roots of the diagonal values in VDU'UDV' = VDDV'. These indices are proportional to and can be interpreted as the extent to which the individual words are represented or reproduced by singular values and, hence, how important the words are for defining the (latent semantic) space extracted by SVD.