Text Mining & TF-IDF¶

TF-IDF stands for Term Frequency Inverse Document Frequency.

Information embedded in a document is used to classify that document.

TF-IDF will assign a weight to a word which signifies its importance.

Term Frequency refers to the frequency with which a term* appears in a document.

TF(t, d) = (number of times word t occurs in document d) / (number of words in document d) The relative frequency of word occurrence will determine the importance of the word within that document.

Document Frequency is defined as the number of documents within the corpus that contain term t.

Inverse Document Frequency

\[IDF(t, D) = \log \frac{|D| + 1}{DF(t, D) + 1}\]

As term presence increases among different documents, the IDF decreases.

\[TF(t, d) \cdot IDF(t, D)\]

Feature Hashing -> "A space efficient way to vectorize raw features."