Use Metrics to Determine LDA Topic Model Size

A More Than 3-Minute Topic Modeling Article

Published in

Towards Data Science

16 min readMar 19, 2022

Topic modeling is the automated discovery of semantically meaningful topics within a body of text. Topic models produce categories, expressed as lists of words, that can be used to divide a body of text into useful groupings. The most common algorithm currently used for topic modeling is Latent Dirichlet Allocation (Blei et al. 2003).

There are a plethora of articles, blog postings, tutorials, and videos that cover the basics of LDA topic modeling. They detail the preparation of the corpus, the creation of the model and often conclude with visualizing the model in pyLDAvis. This article is not another primer on LDA basics and in fact assumes that the reader is already familiar with these steps. Rather than return to these well worn topics, this article focuses on one key, and less discussed part of the process, using metrics to assist in the choice of the right number of topics for a model.

Readers are warned that even in this long article it is only possible to scratch the surface of this topic. The article only covers the use of metrics to guide the process of selecting an optimal number of topics. As the reader will see in the article below, at some point in the process, metrics cease to provide sufficient information to evaluate topic models. The article ends before the task is entirely completed, at the point where automated metrics have to give way to more the more fine-grained, less-automated processes of evaluating the semantic content of the topic lists themselves and their relationship to the text being modeled. Despite these caveats, it is hoped that serious practitioners will benefit from reading it to the end.

Although this is not a tutorial, the data and code for every step covered in this article is published and available on Kaggle and GitHub. There is a Jupyter notebook which will run in Google Colab. The notebook as well as configuration and implementation details can be found in the GitHub repository. The data set used for this article is a randomly selected, 30,000 article subset of a larger publicly licensed dataset called News Articles.

Metrics Won’t Solve Your Problem

(But they will make your life easier)

One of the key characteristics of LDA models is that the developer must choose the number of topics to be modeled to achieve optimal results. Yet there is little written about how to determine that value. One reason for this is that the answer will always be: it depends. The choice of an optimal topic size for a given model will be relative to the characteristics of the text (corpora), itself as well as to the use the model will be put. Are the documents being modeled long or short? Do they contain many different ideas, subjects and themes? Or, are the documents in the corpora carefully prepared academic paper summaries regarding a particular, small set of topics? Perhaps the documents are random tweets, with no particular unifying rhyme or reason?

This article walks the reader through the process of using industry standard metrics to identify a small number of candidate topic sizes for further evaluation. Yet as will be shown these metrics are not a substitute for human judgement. Ultimately a modeler will have to apply subjective criteria. They will have to get down to the level of how well the topics fit the text to decide which model to use. While this time consuming task is unavoidable, using metrics to reduce the size of the task is well worth the effort.

The Toolkit

The remainder of this article is concerned with the practical application of metrics to determine a reasonable set of candidate topic sizes for a 30,000 article news clip database. The objective is to establish a set of topics that both reflects a diversity of topics within the documents while ideally limiting those topics to those a general reader might consider important or meaningful.

Datasets

The corpora dataset provided has both the original news articles as well as a processed, lemmatized, and trimmed version of the text suitable for LDA topic model processing. Code is provided for readers who wish to generate this output for themselves (or experiment with different text preparation strategies).

In addition to the core dataset a copy of the metrics produced by over ninety runs of the dataset is provided as well. Lastly, after the initial creations of the data a supplementary stop-word list was introduced which is provided as well.

Google Colab

While a “pro” account was used for this article, everything was run in a standard mode without taking advantage of additional GPU processors or memory.

Gensim

The Gensim LDA model implementation was used throughout.

OCTIS

Optimizing and Comparing Topic Models is used for its extensive collection of topic model evaluation schemes. OCTIS provides an end-to-end environment for model building and evaluation including a visualization interface. For this article only the OCTIS Gensim LDA wrapper and OCTIS metrics were used.

kneed

Knee-point detection in Python is used to identify “the point of maximum” curvature in the results of the various metrics and thus to choose particular model builds as candidate models.

Plotly

To produce graphs.

Evaluating the Models

Two common cohesion metrics: NPMI and CV are evaluated (Lau et al. 2014 and Röder et al. 2015). In addition a series of metrics that measure topic diversity and similarity are also profiled. The former cohesion metrics are computationally expensive. Two common variations are run here to determine if they produce different results. The topic metrics — which analyze the model’s topic output as opposed to its internal structure, are much less expensive to run than the PMI based cohesion metrics. The three diversity metrics are: Topic Diversity (Dieng et al. 2019), Inverted Rank Based Overlap (Webber et al. 2010) and Kullback-Liebler Divergence. A similarity metric: Pairwise Jaccard Similarity and three Topic Significance Metrics: KL Uniform, KL Vacuous and KL Background are also run (AlSumait et al. 2009).

It is important to remember that LDA models are stochastic. This means that each run of the model will produce different results. Therefore, in this evaluation, multiple models at each selected topic size were run along with their complete suite of metrics. A total of ninety experiments were run. LDA topic models were created for topic number sizes 5 to 150 in increments of 5 (5, 10, 15…. 150). All nine metrics were captured for each run. The metrics for all ninety runs are plotted here:

Averaging the three runs for each of the topic model sizes results in:

Unfortunately, interpreting their results is not straight forward. The NPMI and CV produce an optimal result, but most of the others seem to continue improving beyond the sample sizes chosen. Before continuing an intuition borrowed from other areas of machine learning, determining the knee or elbow of these graphs will help. Fortunately there’s a package to do that: kneed

The output from kneed reveals six candidate topic sizes: 5, 10, 20, 35, 50 and 80.

Jaccard Similarity: 5
Topic Diversity: 10
NPMI: 20
C_V: 20
Inverted RBO: 35
KLBac: 50
KLDiv: 50
KLVac: 50
KLUni: 80

Six is a lot, but better than thirty. At some point it will be necessary to sample each topic and compare it to a set of documents to judge fit. These six models represent two hundred topics. If we sampled just ten documents per topic it would mean reviewing, rating and comparing two thousand documents. It would be nice to narrow down our candidate list even further. A profile of the distribution of topics may point to models which can be eliminated from consideration.

Likelihood Distributions

Since topics are created across the entire corpus, documents will contain more than one topic. The LDA model computes the likelihood that a set of topics exist in a given document. For example one document may be evaluated to contain a dozen topics, none with a likelihood of more than 10%. Another document might be associated with four topics. Of those four one topic might be evaluated at a 90% likelihood, and the remaining 10% distributed across the remaining three topics. This reality can complicate the job at hand, producing a large number of variations (different mixtures of topics per document) to evaluate.

To simplify this problem it is possible to apply another widely accepted intuition — the dominant topic. The dominant topic is the topic with the largest likelihood of being present in a document. For example assume a document that has a distribution of likelihoods of 62, 33, 4, 1. The dominant topic is the one with the largest likely contribution, in this case 62%. When we graph the number of times a topic is the dominant topic for a document our dataset produces:

Note that the ratios between the highest represented and lowest represented changes dramatically in the larger models. In the five topic model the most dominant topic is represented in 28% of the documents and the least in 15%. In the eighty topic model the range is from 7% to 0.001%. In the smaller models, even the least represented topics can be found in hundreds of articles. In the larger models the long tails contain very small numbers of documents. This implies that even if these topics are “good”, they represent a very small portion of the overall documents and may well be just noise in relation to our stated goal of having a good general chunking of the documents. Further profiling that measures how the model distributes documents within a topic may reveal important information.

The above chart show that within a given model there are topics that are dominant for many documents and fewer for others. But it doesn’t reveal how dominant a given topic is for all of its topics. For example a topic that is 20% dominant in one document is counted the same as a topic that is 80% dominant for another. Fortunately, it is trivial to profile the distributions within a given topic:

Note that in the five and ten topic models the ceiling and floor for each topic are roughly between 20% and 100%. In each of the subsequent models this window moves lower. Also note that in the two smaller size models the median values are roughly above 60% and that the number of statistical outliers is very small.

At this level of analysis the numbers point to the small models being very strong indeed. They are overall statistically confident, the average confidence for each topic is well over 50% and the low end of the confidence scores are better than all the other models. Not so with the larger models. As topic size increases, the data becomes increasingly noisy with more outliers, the averages decrease, fewer topics predict at high confidence rates and the number of low predicting topics increases.

Based on these statistics it is possible to make a case for eliminating the two largest models from the evaluation. Remember that the the task to which this model is being put to is to arrive at a set of topics that will allow the corpus to be divided into manageable chunks. The arguments for eliminating the fifty and eighty topic models are:

The larger number of topics are unwieldy for any type of general chunking. Many of the larger model topics represent less than two percent of the corpus. Even if these topics represent semantically meaningfully categories, they don’t serve the overall objective.
In the larger models the average likelihood topic dominance is almost always less than 50%. The larger models are not very confident about their predictions .
The data is very noisy in the two largest models.

At this point it is necessary to cross over over from using metrics and statistics to evaluating the model based on more subjective measures of the samenatics of the model itself. However, before ending the article it is worth taking a quick look at the model output to determine if it is possible to reasonably further reduce the number of candidate topics sizes and make the job easier.

Evaluating the Semantics of the Topic Model

Topic models themselves can be evaluated distinct from the documents they categorize. In fact, the non-PMI based metrics used above do exactly this by evaluating the words that comprise the most common words for that topic. We can evaluate a topic based on its internal coherence, whether or not the words work together and their comprehensibility, whether or not the words together form semantically meaningful concepts or categories.

Topic lists are evaluated for cohesion and comprehensibility. Cohesion is a judgement about whether the words work together well. For example “car, boat, train” or “man, woman” are cohesive lists, whereas “car, boat, train, man, woman” is less cohesive. Comprehensibility measure whether or not some idea, subject or activity clearly emerges from a topic list.

These are the first ten words of each topic in the five topic model (the first number is the number of documents for which the topic is dominant and the second number is the topic id):

============== Five ===============
8609 3 family old leave man know life mother see woman call
6614 4 first win club come back leave second world england match
5706 1 president official call country law former cnn party leader include
4570 2 company school uk business include cent service high money come
4501 0 woman world study many know health first see life even

The topics for the five category model seem to be mixed in both their cohesion and comprehensibility. For example in topic 3 for example family, man, mother, woman work semantically with old, life. Leave, know, see, call aren’t as clear. Topic 4 seems to be the most cohesive, none of the words in that group seem out of place. Topic 4 is also the most comprehensible and clearly is about soccer. Topic 2 seems related to commerce (presumably in the uk) but the words include, cent, comereduce its overall cohesion. Likewise, topic 1 may be about politics, but again there are words in its list that are hard to reconcile with the others. Topic 0 is the hardest to interpret and has the least coherence.

This is interesting, but doesn’t seem to help in our task of quick elimination of this model. However, looking at how well topics fit examples of the text might. A random sampling of documents produces (the topic list is above the beginning of each text sample, “Contribution:” is the likelihood that a topic is represented in the text):

************************
Model:  Five
Document ID:  19866
Topic:  1
Contribution:  0.5211269855499268president official call country law former cnn party leader includeDaily Mail Reporter . PUBLISHED: . 14:04 EST, 6 June 2013 . | . UPDATED: . 14:04 EST, 6 June 2013 . The government says one in 10 youths at juvenile detention facilities around the country reported having been sexually victimized by staff or by other youths. The study by the Bureau of Justice Statistics found that among the more than************************
Model:  Five
Document ID:  12506
Topic:  4
Contribution:  0.4131307005882263first win club come back leave second world england match1 June 2016 Last updated at 16:20 BST  The Gotthard base tunnel is
57km (35-miles) long and took seventeen years to build.  Engineers dug deep under the Swiss Alps mountains to make it and links northern Europe to Italy in the South.  The tunnel will be used for freight trains transporting goods and passenger trains.  It's estimated around************************
Model:  Five
Document ID:  12890
Topic:  3
Contribution:  0.5673818588256836family old leave man know life mother see woman callBy . Daily Mail Reporter . PUBLISHED: . 07:25 EST, 29 November 2013. UPDATED: . 07:25 EST, 29 November 2013 . He had his hind legs
amputated at just a few weeks old after being born with a severe
deformity. But not only has the Boxer puppy overcome his disability by running on his front paws, he also has a specially adapted wheelchair************************
Model:  Five
Document ID:  11310
Topic:  4
Contribution:  0.573677122592926first win club come back leave second world england match(CNN) -- "Glee" will likely end its run after season 6 the final year in the drama's current deal on Fox. "I would not anticipate it goes beyond two more seasons," Fox  entertainment chairman Kevin Reilly told reporters on Thursday. "Never say never, but there's two very clear [story] arcs to get to that end and conclude. If we discover a************************
Model:  Five
Document ID:  4728
Topic:  1
Contribution:  0.580642819404602president official call country law former cnn party leader includeBy . Simon Walters, Glen Owen and Brendan Carlin . PUBLISHED: . 18:25 EST, 27 April 2013 . | . UPDATED: . 18:38 EST, 27 April 2013 . David Cameron's election guru believes that Tory chairman Grant Shapps and Chancellor George Osborne are ‘liabilities’ who will cost the party votes in this week’s crucial town hall polls, it was claimed last

These examples show that the topics are broadly misaligned with their corresponding topic classifications. Based on this, it is reasonable to remove the five topic model from consideration.

Turning to the ten topic model we find that topic coherence is mixed:

=========== Ten =============
4684 3 old leave family man miss car officer see back come
3928 8 club first win match england leave score back side come
3924 7 film see first star come think know world even well
3304 5 world water high china country large area first see many
3053 6 official country military attack security cnn president call leader american
2749 4 party uk bbc service election vote council public company labour
2618 9 woman school student know family parent life call girl want
1979 0 charge judge sentence prison trial murder arrest prosecutor drug month
1900 1 president trump republican obama race campaign first come run car
1861 2 hospital health patient doctor medical dr care treatment die risk

And five randomly chosen document examples reveals:

************************
Model:  Ten
Document ID:  13787
Topic:  7
Contribution:  0.4396437108516693film see first star come think know world even wellLOS ANGELES, California (CNN) -- When director Antoine Fuqua rolls
into a community to shoot a movie, he becomes part of that community. Filmmaker Antoine Fuqua began a program to foster young moviemakers in poor communities. This isn't the case of a Hollywood filmmaker cherry-picking glamorous locations like Beverly Hills or Manhattan. Fuqua's************************
Model:  Ten
Document ID:  19146
Topic:  7
Contribution:  0.4848936200141907film see first star come think know world even wellShinjuku has a population density of about 17,000 people per square
kilometre but undeterred by this it has granted citizenship to a new
resident, who only goes by one name - Godzilla.  Name: Godzilla
Address: Shinjuku-ku, Kabuki-cho, 1-19-1  Date of birth: April 9, 1954 Reason for special residency: Promoting the entertainment of and************************
Model:  Ten
Document ID:  1482
Topic:  1
Contribution:  0.3362347483634949president trump republican obama race campaign first come run car(CNN) -- "An unconditional right to say what one pleases about public affairs is what I consider to be the minimum guarantee of the First Amendment." -- U.S. Supreme Court Justice Hugo L. Black, New York Times Co. vs. Sullivan, 1964 . It's downright disgusting to listen to conservative and Republican lawmakers, presidential candidates,************************
Model:  Ten
Document ID:  28462
Topic:  5
Contribution:  0.5035414695739746world water high china country large area first see manyThe emergency services were called out at about 10:00, and the CHC
helicopter landed at about 10:15.  A CHC spokesperson said: "In
accordance with operating procedures, the crew requested priority
landing from air traffic control.  "This is normal procedure, a light illuminated in the cockpit."  The spokesperson added: "The aircraft************************
Model:  Ten
Document ID:  16179
Topic:  3
Contribution:  0.5338488221168518old leave family man miss car officer see back comeThe 31-year-old right-armer joined from Hampshire ahead of the 2014
campaign, but missed most of the 2015 season with triceps and back
injuries.  Griffiths was Kent's leading wicket-taker in the T20 Blast this season, with 13 at an average of 33.61.  He also played three times in the One-Day Cup, but did not feature in the Count

Overall the topics seem better aligned with representative documents than the five topic model. While there are clearly issues, it seems premature to drop the ten topic model without further evidence.

Summary

This article demonstrated how metrics can be used to assist in determining LDA topic model size. A number of models of different sizes were generated with their attendant metrics and profile statistics. The knee points of the metric output were calculated which identified six candidate topic model sizes. The relatively high statistical noisiness of the two larger models, coupled with the judgment that their shear size would be ill-suited to the task at hand, led to their exclusion from consideration. The analysis turned to the five and ten topic models which seemed statistically viable. At this point the purely metric and statistical tools for evaluation were exhausted. The article covered an initial pass at judging the semantics of the models. The five topic model was clearly inadequate for the task and was easily excluded. While the ten topic model showed some signs of weakness, absent a more thorough analysis it seemed unreasonable to drop it from consideration.

Most writing on the web that deal with LDA topic model creation are either basic tutorials or dense, theoretical papers on the mathematics of LDA or their evaluation. This article sought to provide a template for developers looking beyond the basics, but grounded in practical application. As should be clear, this is not a simple “3-minute” task. The article described the process of using industry standard metrics and statistics to provide guideposts for the task. Hopefully, practitioners seeking to deepen their understanding of how to use LDA in the real world will find this information useful in deepening their understanding of the subject and that it will provide insight into how to improve their own work.

Bibliography

AlSumait, L., Barbará, D., Gentle, J., & Domeniconi, C. (2009). Topic Significance Ranking of LDA Generative Models. Machine Learning and Knowledge Discovery in Databases, 67–82.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research: JMLR. https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf?ref=https://githubhelp.com

Lau, J. H., Newman, D., & Baldwin, T. (2014). Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 530–539.

Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the Space of Topic Coherence Measures. Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, 399–408.

Terragni, S., Fersini, E., Galuzzi, B. G., Tropeano, P., & Candelieri, A. (2021). OCTIS: Comparing and Optimizing Topic models is Simple! 263–270.

Webber, W., Moffat, A., & Zobel, J. (2010). A similarity measure for indefinite rankings. ACM Transactions on Information and System Security, 28(4), 1–38.

Additional References

This article was inspired by:

Selva Prabhakaran (2018) Topic Modeling with Gensim (Python)

Shashank Kapadia (2019) Evaluate Topic Models: Latent Dirichlet Allocation (LDA)