How to Measure Topic Coherence

Tags: LDA

in data-science · Tue 26 February 2019

Unsupervised topic modeling algorithms like LDA and NMF produce a list of vocabularies for each topic after the training. These vocabs help humans to assign the subject information of the topic model. So how we measure the quality of these topic words ?, this problem has to be addressed in unsupervised topic clustering algorithms like LDA / NMF to understand models are improving or not.

It's always a challenge to qualitatively measure the goodness of the words produced by each topic. Usually, we take the top 10 words ( It's recommended to keep the top n-word count up to 7+/-2 ie; 5 to 9 words appropriate for human to judge and come up with a topic name using these words)

The methods discussed here are the standard coherence evaluation metrics, based on Frequentist probabilistic evaluation, TF-IDF, Word2Vec, and SVD based methods, over the top-n words of each topic and the input corpus given into the LDA model.

The probabilistic models generally measure the co-occurrence of the top-n topic words in the actual input corpus and the coherence value will be good if the co-occurrence measure from the top-n words will be higher. See more details of each method and its conventions used.

All unsupervised topic clustering algorithms have to address this point before going into production, ie; how much usable the topics produced by a given method, a human can interpret the meanings of a topic and describe the topic using top N words ( eg N = 10 ).

Based on this paper - coherence evaluation can be structured into 4 stages,

Segmentation of word subsets
Probability Estimation
Confirmation Measure
Aggregation

1. Segmentation of word subsets

In this stage we split the top-n words into pairs, we can do this in multiple ways to support the Boolean document counting or sliding window-based counting discussed in the next sections.

2. Probability Estimation methods

$$ \begin{array}{l} \mathcal{P}_{bd} \ \ \ \ \ \ \ \ \ \ =\ \ \ Boolean\ document\ estimation,\ count\ only\ onces\ in\ a\ given\ document.\\ \mathcal{P}_{bp} \ \ \ \ \ \ \ \ \ \ =\ \ Boolean\ paragraph\ estimation,\ counts\ the\ occurrences\ on\ every\ paragraph.\\ \mathcal{P}_{bs} \ \ \ \ \ \ \ \ \ \ =\ \ Boolean\ sentence\ estimation,\ counts\ on\ occurrences\ on\ every\ sentences\ wise.\\ \mathcal{P}_{sw} \ \ \ \ \ \ \ \ \ =\ \ Sliding\ window\ estimation,\ here\ a\ window\ of\ size\ N\ has\ been\ used\ to\ move\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ over\ the\ document\ and\ each\ window\ is\ considered\ a\ virtual\ document,\ and\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ do\ \mathcal{P}_{bd} \ counts\ on\ each\ windows.\\ \\ \ P( w_{j} ,\ w_{i}) =\ \ \ \ \#documents\ ( w_{j} ,\ w_{i}) \ co-occures\ /\ \#\ total\ number\ of\ documents.\\ \ P( w_{j}) \ \ \ \ \ \ =\ \ \ \ \ \#documents\ w_{j} \ occures\ /\ \#\ total\ number\ of\ documents\\ \\ The\ denominators\ /\ normalization\ term\ of\ the\ joint\ and\ marginal\ probability\ can\ be\ different\\ based\ on\ the\ method\ used\ for\ estimate\ the\ same.\ Above\ one\ is\ used\ if\ \mathcal{P}_{bd} \ is\ the\ estimation\ used.\\ \\ For\ other\ estimation\ types\ like\ \mathcal{P}_{np} \ we\ use\ \#\ total\ number\ of\ paragraph\ as\ the\ normalisation\ term. \end{array} $$

3. Confirmation Measures

In this phase, the actual scoring function is defined, which makes uses of any of the segmentation or probabilistic measuring methods described above. Let’s see some of the widely used coherence measuring methods.

3.1 UMass

UMass is the simplest method of all mentioned bellow and computes time is least among all others.

$$ \begin{array}{l} C_{UMass} \ =\dfrac{1}{^{N} C_{2}} \ \sum ^{N}_{j=2} \ \sum ^{j\ -1}_{i=1} \ log\left(\dfrac{P( w_{j} ,\ w_{i}) \ +\ \epsilon }{P( w_{i})}\right)\\ \\ N=\ \#Top\ words\ from\ a\ Topic.\ eg;\ N=10\\ \\ Here\ the\ P( w_{j} ,\ w_{i}) \ co-occurrence\ is\ calculated\ by\ using\ \ \mathcal{P}_{bd} \ \ method\ mentioned\ above.\ \\ \ \ \end{array} $$

3.2 UCI

Slightly improved version of UMass, and applying the sliding window based probabilistic measure.

$$ \begin{array}{l} C_{UCI} \ =\dfrac{1}{^{N} C_{2}} \ \sum ^{N}_{j=2} \ \sum ^{j\ -1}_{i=1} \ log\left(\dfrac{P( w_{j} ,\ w_{i}) \ +\ \epsilon }{P( w_{i}) \ P( w_{j})}\right)\\ \\ N=\ \#Top\ words\ from\ a\ Topic.\ eg;\ N=10\\ \\ Here\ the\ co-occurrence\ is\ calculated\ by\ applying\ the\ sliding\ window\ over\ the\\ text\ document. \end{array} $$

3.3 NPMI - Normalized Pointwise Mutual Information

Measuring the co-occurrence of words as the name suggests. This one is an improved version of PMI, by applying an added normalization to PMI.

$$ \begin{array}{l} N=\ \#Top\ words\ from\ a\ Topic.\ eg;\ N=10\\ \\ C_{NPMI} \ =\ \dfrac{1}{^{N} C_{2}} \ \sum ^{N}_{j=2} \ \sum ^{j\ -1}_{i=1}\left( \ \dfrac{\left(\dfrac{log\ ( P( w_{j} ,w_{i})) \ +\ \epsilon }{P( w_{j}) \ P( w_{i})}\right)}{-\ log( P( w_{j} ,\ w_{i})) \ +\ \epsilon }\right)\\ \\ P( w_{j} ,\ w_{i}) \ =\ Frequency\ of\ these\ two\ tokens\ co-occurrence\ on\ the\ input\ datasets.\ \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ Here\ we\ apply\ \mathcal{P}_{sw} \ method. \end{array} $$

Outside this 4 stage framework, there are multiple coherence measures available, we can fit those measures along with the above-mentioned coherence measuring framework; even though some parts aren't relevant in some cases. Few of those measures are listed below,

3.4 Word2Vec based similarity score

Here we are making use of the semantic meanings of each word on the word2vec vector space and finding the cosine similarity between two word vectors.

$$ \displaystyle \dfrac{1}{^{N} C_{2}} \ \sum ^{N}_{j=2} \ \sum ^{j\ -1}_{i=1} \ similarity( wvi,\ wvj) $$

3.5 TF-IDF based improvement for UMass method

Here instead of measuring co-occurrence of two words across the documents, measure its relevance using tf-idf calculation.

$$ \begin{array}{l} c( t,\ W_{t}) \ -\ Topic\ t\ is\ characterized\ by\ its\ set\ of\ top\ W_{t} \ words.\\ \\ c_{tf-idf}( t,\ W_{t}) \ =\ \sum _{w_{1} ,w_{2} \ \in \ W_{t}} log\ \left(\dfrac{\sum _{d:\ w_{1} ,w_{w} \ \in \ d}( tf-idf( w_{1} ,\ d) \ \times \ tf-idf( w_{2} ,\ d) \ +\ \epsilon )}{\sum _{d:\ w_{1} \ \in \ d} tf-idf( w_{1} ,\ d)}\right)\\ Where;\\ \\ a) \ tf-idf( w_{1} ,\ d) \ =\ tf( w,\ d) \ \times \ idf( w) \ \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ =\ \dfrac{1}{2} \ \dfrac{f( w,\ d)}{max\ _{w^{*} \ \in \ d} \ f\left( \ w^{*} ,\ d\right)} \ \ \times \ log\ (\dfrac{|D|}{|\{d\ \in \ D:\ w\ \in \ d\} |}\\ b) \ f( w,\ d) \ =\ Frequency\ of\ word\ w\ in\ document\ d.\\ c) \ max\ _{w^{*} \ \in \ d} \ f\left( \ w^{*} ,\ d\right) \ =\ Normalise\ it\ with\ max\ frequency\ of\ word\ on\ that\ same\ document.\\ d) \ w^{*} \ \ \neq \ \ w \end{array} $$

3.6 TBuckets

"we aim at measuring the quality of a single topic and propose a novel approach - TBuckets, which groups a topic’s words into thematic groups (which we call buckets). The intuition is that if a single large bucket is obtained from a topic, the topic carries a single coherent theme"

Refer paper

$$ \begin{array}{l} A\ =\ U{\textstyle \sum V^{T} \ \ \ ;\ SVD\ of\ the\ Matrix\ A}\\ \\ \\ A=\ N\ \times \ D\\ \\ Where:\ N\ \rightarrow \ top\ N\ words\ from\ topic\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ D\ \rightarrow \ Dimention\ of\ word\ embeddings.\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ A\ \rightarrow \ Word2Vec\ vectors\ for\ top\ N\ words\ of\ a\ topic.\\ \\ \sum \ \rightarrow \ Captures\ the\ different\ theme\ on\ same\ cluster\ with\ top\ N\ words.\\ \ \ \ \ \ \ \ \ \ \ \ \ \ These\ are\ the\ Buckets\ in\ TBucket\ algorithm.\\ \\ Eg;\ If\ we\ take\ top\ 10\ words\ from\ a\ topic,\ we\ can\ bucket\ them\ under\ major\ categories\ like\\ \ \ \ \ \ \ T_{1} \ =\ \{\ aircraft( 5) ,\ footwear\ ( \ 3) ,\ beverage( 2) \ \} \ \ \\ \ \ \ \ \ \ Higher\ the\ size\ of\ the\ bucket\ compared\ to\ others,\ model\ done\ better\ on\ topic\ identification.\\ \\ To\ allocate\ the\ word\ vectors\ under\ these\ buckets\ ( \ eigenvectors\ ) ,\ instead\ of\ naive\ assignment,\\ use\ Interger\ Linear\ Programming\ or\ Linear\ optimization\ to\ get\ better\ allocation\ and\ reduce\ the\\ fragmentation. \end{array} $$

4. Aggregation strategies

All the coherence measures discussed till now mainly deal with per topic level, to aggregate the measure for the entire model we need to aggregate all the topic level scores into one. The common method applied here is the arithmetic means of the topic level coherence score. Or other types of statistical summary like std or median etc.

Jaccard Similarity Measure for Model

All the above mentioned measuring mechanisms discuss coherence of individual topics and then we are applying standard aggregation over these scores via simple arithmetic mean. Is there any measure of the quality of all the topics or relationships between them?

The Jaccard Similarity between topics helps to understand how the topics are dependent on each other. If the similarity is higher means topics are very dependent on each other, otherwise topics are discussing similar domain ( eg; automobile ). The key thing to care about is, there is know good setting for this value, it's purely set based on the business requirement and nature of the data.

$$ \begin{array}{l} K\ =\ \#Topics\\ \\ MPJ_{m,\ k} \ =\ \dfrac{1}{^{K} C_{2}} \ \sum ^{K}_{j=2} \ \sum ^{j\ -1}_{i=1} \ \left( \ \dfrac{TD_{i} \ \cap \ TD_{j}}{TD_{i} \ \cup \ TD_{j}}\right) \end{array} $$

Thank you, That's all for now. Hope these heuristics helped you to understand your models well.

NOTE: This blog entry was originally published at https://labs.imaginea.com/post/how-to-measure-topic-coherence/
Location, and republished here with permission.

References

1. https://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf

2. TBuckets paper http://www.aclweb.org/anthology/E17-2070

3. http://www.cs.loyola.edu/~binkley/papers/icpc14-lda.pdf

4. Gensim Implementation - https://radimrehurek.com/gensim/models/coherencemodel.html

5. Math equations are built using this site - https://www.mathcha.io/editor/w6PBH0GSWNtDpir3

Go Top

Show/Post Comments