Term frequency and inverse document frequency are computed using raw counts. Term frequency is $ # v D$ and document frequency is \(D(v)\). If \(|D| = N\) then inverse document frequency is \[ \text{idf}(v, D) = \log \frac{N}{D(v)}. \]
Computed with the bind_tf_idf
function from tidytext
.
The perplexity estimates how well the learned probability distribution of topics over documents fits new samples from a document. The perpelexity of a distribution, \(p\) estimated on a holdout set \(q\), with events \(x_i\), is
\[
\text{Perplexity}(p,q) = \exp(-1/N \sum^N \log q(x_i).
\] For texts, this is often normalized to a per word perplexity.
Computed with the perplexity
function from topicmodels
.
(UMass, Mimno)
The topic coherence measures the internal consistency of each topic. The measure is computed by the SpeedReader package and the background is described in Optimizing Semantic Coherence in Topic Models. This is an ‘internal’ score; Unlike pointwise mutual information, for example, there is no need to compute word co-occurences on an external corpus.
The coherence \(C\) of topic \(t\) is where the top \(M\) most probable words are denoted $V^{(t)} = v_k, k {1,2,…, M}, is defined. \[ C(t; V^{(t)}) = \sum_{l}^M \sum_{m}^M \log \frac{D(v_m^{(t)}, v_l^{(t)}) + 1}{ D(v_l^{(t)})}. \]
The measure is asymmetric and depends on the ordering of the top \(M\) terms in topic \(t\). The measure is the empirical conditional probability, \(\log p(v_i, v_j) = \frac{p(v_j, v_i)}{p(v_i)}\) with zero probabilities eliminated by adding 1 to \(D(v_i, v_j)\).
Computed with the coherence
function from SpeedReader
.
For a coherence pipeline design see this paper and the Python gensim library.
The Rand index and adjusted Rand index measure how well a clustering matches existing labeled classes. The Rand indices measures the aggrement of pairs of elements. We can think of putting a pair in the same cluster as a type of classification. If represented as a contigency table then we represent true positives(TP), true negatives (TN), false positives (FP), and false negatives (FN) as
Reality \(\to\) | T | F |
---|---|---|
T | a = TP | b = FP |
F | c = FN | d = TN |
For \(n\) labeled clusters the total number of pairs is \({n \choose 2}\).
For clusterings \(c_i, c_j\) the Rand index is \[ \text{Rand}(c_i, c_j) = \frac{a + d}{{n \choose 2}}. \]
The adjusted Rand index adjusts for the expected number of randomly correct clusters.
Computed with the adjustedRand
function from clues
.
The other indices computed are:
\[ \text{FM} = \frac{a}{\sqrt{(a + b)(a + c)}} \]
JA: Jaccard \[ \text{JA} = \frac{a}{a + b + c} \]
HA: Hubert-Arabie’s adjusted Rand index. (Equivalent to Cohen’s Kappa computed on the contigency table)
MA: Morie and Agresti’s adjusted Rand index.