Topic Modeling for Evolving Textual Data Using LDA, HDP, NMF, BERTOPIC, and DTM With a Focus on Research Papers
DOI:
https://doi.org/10.37802/joti.v5i2.618Keywords:
BERTopic, Dynamic Topic Modeling (DTM), Evolving Textual Data, Hierarchical Dirichlet Process (HDP), Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), Topic Modeling, Research Papers and Academic LiteratureAbstract
As the volume of academic literature continues to burgeon, the necessity for advanced tools to decipher evolving research trends becomes increasingly apparent. This study delves into the utilization of topic modeling techniques—specifically Latent Dirichlet Allocation (LDA), Hierarchical Dirichlet Process (HDP), Non-negative Matrix Factorization (NMF), BERTopic, and Dynamic Topic Modeling (DTM)—applied to a dynamic corpus of research papers. Our research endeavors to confront the challenges posed by capturing temporal dynamics, evolving terminology, and interdisciplinary themes within academic literature. Through a comprehensive comparative investigation of these models, we assess their efficacy in extracting and tracking research topics over time. While DTM exhibited the highest term topic probability, its inclusion of non-meaningful words proved to be a hindrance to its suitability. Conversely, NMF, HDP, LDA, and BERTopic demonstrated comparable performance in topic extraction. Surprisingly, DTM emerged as the most effective model in our research, showcasing its prowess in navigating the intricacies of evolving research trends.
Downloads
References
Singhal, T., Liu, J., Blessing, L.T.M.,Lim, K.H.,”Analyzing scientific publications using domain-specific word embedding and topic modeling”, IEEE International Conference on Big Data (Big Data),2021, pp. 4965–4973.
Alghamdi, R., and Alfalqi, K,” A survey of topic modeling in text mining”, Int. J. Adv. Comp. Sci. Appl. 6,2015, pp. 147–153.
Debortoli, S., Müller, O., Junglas, I., &vom Brocke, J. Text Mining for Information Systems Researchers: An Annotated Topic Modeling Tutorial. Communications of the Association for Information Systems (CAIS), 39(7), 2016, pp.110-135.
Michal Rosen-Zvi, MarkSteyvers,"The Author-Topic Model for Authors and Documents", UAI '04, Proceedings of the 20th Conference in Uncertainty in Artificial Intelligence, Banff, Canada, July,2004,
Anupriya P, Karpagavalli S,”LDA based topic modeling of journal abstracts”, International Conference on Advanced Computing and Communication Systems. IEEE; 2015. pp. 1–5.
Anantharaman, A., Jadiya, A., Siri, C. T. S., Bharath Nvs, A., and Mohan, B. “Performance evaluation of topic modeling algorithms for text classification,” in 3rd International Conference on Trends in Electronics and Informatics (ICOEI) (Tirunelveli),2019
Neogi, P. P. G., Das, A. K., Goswami, S., and Mustafi,“Topic modeling for text classification in Emerging Technology in Modelling and Graphics”, Advances in Intelligent Systems and Computing, Vol. 937, Singapore: Springer,2020,pp. 395–407.
Newman D, Noh Y, Talley E, Karimi S, Baldwin T,”Evaluating topic models for digital libraries”, Proceedings of the 10th Annual joint conference on digital libraries, 2010. pp. 215–224.
David M. Blei, Andrew Y. Ng, and Michael I. Jordan,“Latent Dirichlet allocation”. Journal of Machine Learning Research,2003, 3(1), pp.993–1022.
E. Linstead, C. Lopes, P. Baldi," An application of latent Dirichlet allocation to analyzing software evolution ", Proceedings of the 7th International Conference on Machine Learning and Applications, ICMLA ’08, 978-0-7695-3495-4, IEEE Computer Society, Washington, DC, USA (2008), pp. 813-818
Zoghbi, S., I. Vulic, and M.-F. Moens, “Latent Dirichlet allocation for linking user-generated content and e-commerce data”. Information Sciences, 2016. 367: pp. 573-599.
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan, “Nested hierarchical Dirichlet processes”, IEEE Transactions on Pattern Analysis and Machine Intelligence,2015,37(2):pp.256–270.
Y. Teh, M. Jordan, M. Beal, and D. Blei,“Hierarchical dirichlet processes. Journal of the American Statistical Association”, 2006, 101(576): pp.1566-1581.
Jun Li, Jos´e M Bioucas-Dias, Antonio Plaza, and Lin Liu, “Robust collaborative nonnegative matrix factorization for hyperspectral unmixing,” IEEE Transactions on Geoscience and Remote Sensing,2016, vol. 54, no. 10, pp. 6076–6090.
Jaegul Choo, Changhyun Lee, Chandan K Reddy, and Haesun Park, ”Weakly supervised nonnegative matrix factorization for user-driven clustering. Data Mining and Knowledge Discovery 29, 6 (2015), pp.1598–1621.
M.W. Berry, M. Browne, A.N. Langville, V.P. Pauca, R.J. Plemmons, “Algorithms and Applications for Approximate Non-Negative Matrix Factorization”, Comput. Stat. Data Anal. 52 (1) (2007), pp.155–173.
Grootendorst, M.R.,”BERTopic: Neural topic modeling with a class-based TF-IDF procedure”. ArXiv, abs/2203.05794.2022
Xiaohui Yan, Jiafeng Guo, Yanyan Lan, XueqiCheng,"Abiterm topic model for short texts", Proceedings of the 22nd international conference on World Wide Web,May 2013,pp. 1445–1456
D. M. Blei and J. D. Lafferty,” Dynamic topic models”, Proceedings of the 23rd International Conference on Machine Learning,2006,pp.113–120.
Ren, L., Dunson, D. B., & Carin, L. ,"The dynamic hierarchical Dirichlet process", Proceedings of the 25th International Conference on Machine Learning, 2008,pp. 824–831.
J. F. Canny and T. L. Rattenbury, "A Dynamic Topic Model for Document Segmentation," EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2006-161, Dec. 2006.