Word2Vec has hit the NLP world for a while, as it is a nice method for word embeddings or word representations. Its use of skip-gram model and deep learning made a big impact too. It has been my favorite toy indeed. However, even though the words do have a correlation across a small segment of text, it is still a local coherence. On the other hand, topic models such as latent Dirichlet allocation (LDA) capture the distribution of words within a topic, and that of topics within a document etc. And it provides a representation of a new document in terms of a topic.
In my previous blog entry, I introduced Chris Moody’s LDA2Vec algorithm (see: his SlideShare). Unfortunately, not many papers or blogs have covered this new algorithm too much despite its potential. The API is not completely well documented yet, although you can see its example from its source code on its Github. In its documentation, it gives an example of deriving topics from an array of random numbers, in its lda2vec/lda2vec.py code:
from lda2vec import LDA2Vec n_words = 10 n_docs = 15 n_hidden = 8 n_topics = 2 n_obs = 300 words = np.random.randint(n_words, size=(n_obs)) _, counts = np.unique(words, return_counts=True) model = LDA2Vec(n_words, n_hidden, counts) model.add_categorical_feature(n_docs, n_topics, name='document id') model.finalize() doc_ids = np.arange(n_obs) % n_docs loss = model.fit_partial(words, 1.0, categorical_features=doc_ids)
A more comprehensive example is in examples/twenty_newsgroup/lda.py .
Besides, LDA2Vec, there are some related research work on topical word embeddings too. A group of Australian and American scientists studied about the topic modeling with pre-trained Word2Vec (or GloVe) before performing LDA. (See: their paper and code) On the other hand, another group with Chinese and Singaporean scientists performs LDA, and then trains a Word2Vec model. (See: their paper and code) And LDA2Vec concatenates the Word2Vec and LDA representation, like an early fusion.
No matter what, representations with LDA models (or related topic modeling such as correlated topic models (CTM)) can be useful even outside NLP. I have found it useful at some intermediate layer of calculation lately.
- Kwan-Yuet Ho, “Toying with Word2Vec,” WordPress (2015).
- Kwan-Yuet Ho, “LDA2Vec: a Hybrid of LDA and Word2Vec,” WordPress (2016).
- Christopher Moody, “word2vec, LDA, and introducing a new hybrid algorithm: lda2vec”, 2016. [SlideShare]
- lda2vec github.
- Christopher Moody @ CrunchBase.
- Nikita Nikitinsky, “A tale about LDA2Vec: when LDA meets word2vec“, NLPx (2016).
- StackOverflow: Using Word2Vec for Topic Modeling.
- J. Pennington, R. Socher, C. D. Manning, “GloVe: Global Vectors for Word Representation”, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014). [link, pdf]
- Dat Quoc Nguyen, Richard Billingsley, Lan Du and Mark Johnson, “Improving Topic Models with Latent Feature Word Representations,” Transactions of the Association for Computational Linguistics, vol. 3, pp. 299-313 (2015). [link]
- Yang Liu, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun, “Topical Word Embeddings,” In proceedings of 29th AAAI Conference on Artificial Intelligence, 2418-2424 (2015).[link]
- T. Mikolov, K. Chen, G. Corrado, J. Dean, “Efficient Estimation of Word Representations in Vector Space”,Proceedings of Workshop at ICLR (2013). [arXiv:1301.3781]
- D. M. Blei, A. Y. Ng, M. I. Jordan, “Latent Dirichlet Allocation,” J. Machine Learning Research 3, 993 (2003). [PDF]
- D. M. Blei, J. D. Lafferty, “Correlated Topic Models,” in Weiss, Y., Schölkopf, B., and Platt, J., editors, Advances in Neural Information Processing Systems 18. [PDF]
- Radim Řehůřek, “Making sense of word2vec“, RaRe Technologies (2014).