Text Mining

While data mining is typically concerned with the detection of patterns in numeric data, text mining generally refers to the process of automatically extracting "meaning" from a collection of documents. This is typically accomplished by algorithms that will enumerate the words, terms, and structure of the documents. The list of documents and their contents, once enumerated and "numericized"  in that manner, can then be submitted to further numerical analyses to derive concepts, significant terms, etc.

Typical applications for text mining include surveys or data analysis projects where some responses are unstructured and textual (e.g., email messages, open-ended comments on a questionnaire or suggestion form, patients' or physicians' descriptions of symptoms, narratives accompanying warranty claims, etc.), and where such information needs to be incorporated into the overall analyses. It is also common to use these techniques to derive predictive models that can be used to automatically classify text, e.g., to automatically route emails to the most appropriate party for further processing, or to distinguish between "junk" email and important messages, to screen out the former.  

A detail discussion of text mining methods, as well as the history of different approaches can be found in Manning and Schütze (2002) and Miner, G.; Elder, J., Hill, T., Nisbet, R., Delen, D., Fast, A. (2012); see also the STATISTICA Text Mining and Document Retrieval Introductory Overview for additional details.