Here are some notes of TF-IDF as a measure for identifying keywords in a document. To efficiently indexing documents for future search, it is important to identify the keywords in a document. For a word $latex w$ to be a good keyword, first, it should be relatively rare. This can be measured by the inverse document frequency (IDF), which is equal to
$latex \log_2 \frac{N}{N_w}$,
where $latex N$ is the number of all known documents and $latex N_w$ are the number of documents containing the word $latex w$. Secondly, a useful keyword is likely to appear frequently in the target document, thus we should multiple IDF by the term-frequency, which will be denoted as $latex n_w^d$ for the word $latex w$ and a document $latex d$. Therefore, the TF-IDF is defined as
TF-IDF = TF $latex \times$ IDF = $latex n_w^d \log_2 \frac{N}{N_w}$.
Now, let’s try to interpret TF-IDF in the context of information theory. In particular, the mutual information between two variables $latex X$ and $latex Y$ depicts the degree of correlation between the variables. If we consider the expression of the mutual information,
$latex I(X;Y)=\sum_{x,y} p(x,y) \log_2\frac{p(x,y)}{p(x)p(y)}\triangleq\sum_{x,y} i(x;y)$.
We can see that $latex I(X;Y)$ is large if individual terms $latex i(x;y)$ are large. This suggests that $latex i(x;y)$ can be used as a measure of how relevant between two outcomes $latex x$ and $latex y$.
Now, if we let $latex D$ denote a random document uniformly sampled from all possible documents and $latex W$ denote a random word uniformly sampled from all words inside all documents. What will be $latex i(d;w)$ for a document $latex d$ and a word $latex w$? We have
$latex i(d;w) = p(d,w) \log_2 \frac{p(d|w)}{p(d)}$
$latex = p(d) p(w|d) \log_2 \frac{1/N_w}{1/N}$
$latex = \frac{1}{N}\frac{n_w^d}{n^d} \log_2 \frac{N}{N_w}$
$latex = \frac{1}{N} \frac{TF-IDF(d,w)}{n^d},$
where $latex n^d$ is the number of words in the document $latex d$.
Now, if we want to find the best keyword of a document $latex d$, we would try to find the $latex w$ that has the maximum $latex i(d;w)$. That is,
$latex w^*=\arg\max_w i(d;w)$
$latex =\arg\max_w \frac{1}{N} \frac{TF-IDF(d,w)}{n^d}$
$latex =\arg\max_w TD-IDF(d,w)$.
Note that the last equality holds since $latex N$ and $latex n^d$ are independent of $latex w$. This indeed suggests that TF-IDF is a good relevance measure between a document and a word.