Latent Semantic Indexing

In the context of text mining, the process of latent semantic indexing is concerned with the derivation of underlying dimensions of "meaning" from the words (terms) extracted from a collection of documents.

The most basic result of text mining is an initial indexing of words found in the input documents, and the computation of a frequency table with simple counts enumerating the number of times that each word occurs in each input document. Also, in practice, you can further transform those raw counts to indices that better reflect the (relative) "importance" of words and/or their semantic specificity in the context of the set of input documents (see, for example, inverse document frequencies).

Next, a common analytic tool for interpreting the "meaning" or "semantic space" described by the words that were extracted and, hence, by the documents that were analyzed, is to create a mapping of the words and documents into a common space, computed from the word frequencies or transformed word frequencies (e.g., inverse document frequencies). In general, here is how it works:

Suppose you index a collection of customer reviews of their new automobiles (e.g., for different makes and models). You may find that every time a review includes the word "gas-mileage," it  also includes the term "economy." Further, when reports include the word "reliability" they also include the term "defects" (e.g., make reference to "no defects"). However, there is no consistent pattern regarding the use of the terms "economy" and "reliability," i.e., some documents include either one, both, or neither. In other words, these four words "gas-mileage" and "economy," and "reliability" and "defects," describe two independent dimensions - the first having to do with the overall operating cost of the vehicle, the other with quality and workmanship.

The idea of latent semantic indexing is to identify such underlying dimensions (of "meaning"), into which the words and documents can be mapped. As a result, you can identify the underlying (latent) themes described or discussed in the input documents, and also identify the documents that mostly deal with each dimension (e.g., economy, reliability, or both).

In practice (e.g., in STATISTICA Text Mining and Document Retrieval), singular value decomposition is often used to extract the underlying semantic dimensions from the matrix of (transformed) word counts across documents.

For more information, see Manning and Schütze (2002) and Miner, G.; Elder, J., Hill, T., Nisbet, R., Delen, D., Fast, A. (2012); see also the STATISTICA Text Mining and Document Retrieval Introductory Overview.