Inverse Document Frequency

The inverse document frequency is a useful transformation of raw word frequency counts computed in the context of text mining, in order to simultaneously express the frequencies with which specific terms or words are used in a collection of documents, as well as their semantic specificities, i.e., the extent to which particular words are used only in specific documents in the collection.

Suppose you index a collection of text documents and compute the word frequencies (wf) to enumerate the number of times that each word or term is used in each document. A particular issue that you may want to consider more carefully, and reflect in the indices used in further analyses, are the relative document frequencies (df) of different words. For example, a term such as "guess" may occur frequently in all documents, while another term such as "software" may only occur in a few. The reason is that one might make "guesses" in various contexts, regardless of the specific topic, while "software" is a more semantically focused term that is likely to occur only in documents that deal with computer software. A common and very useful transformation that reflects both the semantic specificity of words (document frequencies) as well as the overall frequencies of their occurrences (word frequencies) is the so-called inverse document frequency (for word i and document j):

In this formula (see also formula 15.5 in Manning and Schütze, 2002), N is the total number of documents, and dfi is the document frequency for the i'th word (the number of documents that include this word). Hence, it can be seen that this formula includes both a dampening of the simple word frequencies via a log function, and also includes a weighting factor that evaluates to 0 if the word occurs in all documents (log(N/N=1)=0), and to the maximum value when a word only occurs in a single document (log(N/1)=log(N)). It can easily be seen how this transformation will create indices that both reflect the relative frequencies-of-occurrences of words, as well as their semantic specificities over the documents included in the analysis.

For more information, see Manning and Schütze (2002) and Miner, G.; Elder, J., Hill, T., Nisbet, R., Delen, D., Fast, A. (2012); see also the STATISTICA Text Mining and Document Retrieval Introductory Overview.