Both LDA (latent Dirichlet allocation) and Word2Vec are two important algorithms in natural language processing (NLP). LDA is a widely used topic modeling algorithm, which seeks to find the topic distribution in a corpus, and the corresponding word distributions within each topic, with a prior Dirichlet distribution. Word2Vec is a vector-representation model, trained from RNN (recurrent neural network), to seek a continuous representation for words.
They are both very useful, but LDA deals with words and documents globally, and Word2Vec locally (depending on adjacent words in the training data). A LDA vector is so sparse that the users can interpret the topic easily, but it is inflexible. Word2Vec’s representation is not human-interpretable, but it is easy to use. In his slides, Chris Moody recently devises a topic modeling algorithm, called LDA2Vec, which is a hybrid of the two, to get the best out of the two algorithms.
Honestly, I never used this algorithm. I rarely talk about something I didn’t even try, but I want to raise awareness so that more people know about it when I come to use it. To me, it looks like concatenating two vectors with some hyperparameters, but the source codes rejects this claim. It is a topic model algorithm.
There are not many blogs or papers talking about LDA2Vec yet. I am looking forward to learning more about it when there are more awareness.
- Christopher Moody, “word2vec, LDA, and introducing a new hybrid algorithm: lda2vec”, 2016. [SlideShare]
- Christoper Moody’s twitter.
- lda2vec Examples on Jupyter Notebook.
- lda2vec github.
- Nikita Nikitinsky, “A tale about LDA2Vec: when LDA meets word2vec“, NLPx (2016).
- D. M. Blei, A. Y. Ng, M. I. Jordan, “Latent Dirichlet Allocation,” J. Machine Learning Research 3, 993 (2003). [PDF]
- T. Mikolov, K. Chen, G. Corrado, J. Dean, “Efficient Estimation of Word Representations in Vector Space,” ICLR 2013 (2013). [arXiv:1301.3781].
- Kwan-Yuet Ho, “Toying with Word2Vec,” WordPress (2015).
- Radim Řehůřek, “Making sense of word2vec,” RaRe Technologies (2014).