Local and Global: Words and Topics

Word2Vec has hit the NLP world for a while, as it is a nice method for word embeddings or word representations. Its use of skip-gram model and deep learning made a big impact too. It has been my favorite toy indeed. However, even though the words do have a correlation across a small segment of text, it is still a local coherence. On the other hand, topic models such as latent Dirichlet allocation (LDA) capture the distribution of words within a topic, and that of topics within a document etc. And it provides a representation of a new document in terms of a topic.

In my previous blog entry, I introduced Chris Moody’s LDA2Vec algorithm (see: his SlideShare). Unfortunately, not many papers or blogs have covered this new algorithm too much despite its potential. The API is not completely well documented yet, although you can see its example from its source code on its Github. In its documentation, it gives an example of deriving topics from an array of random numbers, in its lda2vec/lda2vec.py code:

from lda2vec import LDA2Vec
n_words = 10
n_docs = 15
n_hidden = 8
n_topics = 2
n_obs = 300
words = np.random.randint(n_words, size=(n_obs))
_, counts = np.unique(words, return_counts=True)
model = LDA2Vec(n_words, n_hidden, counts)
model.add_categorical_feature(n_docs, n_topics, name='document id')
model.finalize()
doc_ids = np.arange(n_obs) % n_docs
loss = model.fit_partial(words, 1.0, categorical_features=doc_ids)

A more comprehensive example is in examples/twenty_newsgroup/lda.py .

Besides, LDA2Vec, there are some related research work on topical word embeddings too. A group of Australian and American scientists studied about the topic modeling with pre-trained Word2Vec (or GloVe) before performing LDA. (See: their paper and code) On the other hand, another group with Chinese and Singaporean scientists performs LDA, and then trains a Word2Vec model. (See: their paper and code) And LDA2Vec concatenates the Word2Vec and LDA representation, like an early fusion.

No matter what, representations with LDA models (or related topic modeling such as correlated topic models (CTM)) can be useful even outside NLP. I have found it useful at some intermediate layer of calculation lately.

proj3_lda20structure
Graphical Representations of LDA model. Source: http://salsahpc.indiana.edu/b649proj/images/proj3_LDA%20structure.png

Continue reading “Local and Global: Words and Topics”

LDA2Vec: a hybrid of LDA and Word2Vec

Both LDA (latent Dirichlet allocation) and Word2Vec are two important algorithms in natural language processing (NLP). LDA is a widely used topic modeling algorithm, which seeks to find the topic distribution in a corpus, and the corresponding word distributions within each topic, with a prior Dirichlet distribution. Word2Vec is a vector-representation model, trained from RNN (recurrent neural network), to seek a continuous representation for words.

They are both very useful, but LDA deals with words and documents globally, and Word2Vec locally (depending on adjacent words in the training data). A LDA vector is so sparse that the users can interpret the topic easily, but it is inflexible. Word2Vec’s representation is not human-interpretable, but it is easy to use. In his slides, Chris Moody recently devises a topic modeling algorithm, called LDA2Vec, which is a hybrid of the two, to get the best out of the two algorithms.

Honestly, I never used this algorithm. I rarely talk about something I didn’t even try, but I want to raise awareness so that more people know about it when I come to use it. To me, it looks like concatenating two vectors with some hyperparameters, but  the source codes rejects this claim. It is a topic model algorithm.

There are not many blogs or papers talking about LDA2Vec yet. I am looking forward to learning more about it when there are more awareness.

lda2vec
Jupyter Notebook for LDA2Vec Demonstration [link]
Continue reading “LDA2Vec: a hybrid of LDA and Word2Vec”

Create a free website or blog at WordPress.com.

Up ↑