Probabilistic Theory of Word Embeddings: GloVe

The topic of word embedding algorithms has been one of the interests of this blog, as in this entry, with Word2Vec [Mikilov et. al. 2013] as one of the main examples. It is a great tool for text mining, (for example, see [Czerny 2015],) as it reduces the dimensions needed (compared to bag-of-words model). As an algorithm borrowed from computer vision, a lot of these algorithms use deep learning methods to train the model, while it was not exactly sure why it works. Despite that, there are many articles talking about how to train the model. [Goldberg & Levy 2014, Rong 2014 etc.] Addition and subtraction of the word vectors show amazing relationships that carry semantic values, as I have shown in my previous blog entry. [Ho 2015]

However, Tomas Mikolov is no longer working in Google, making the development of this algorithm discontinued. As a follow-up of their work, Stanford NLP group later proposed a model, called GloVe (Global Vectors), that embeds word vectors using probabilistic methods. [Pennington, Socher & Manning 2014] It can be implemented in the package glove-python in python, and text2vec in R (or see their CRAN post). Their paper is neatly written, and a highly recommended read.

To explain the theory of GloVe, we start with some basic probabilistic picture in basic natural language processing (NLP). We suppose the relation between the words occur in certain text windows within a corpus, but the details are not important here. Assume that $i$ , $j$ , and $k$ are three words, and the conditional probability $P_{ik}$ is defined as

$P_{ij} = P(j | i) = \frac{X_{ij}}{X_i}$ ,

where $X$ ‘s are the counts, and similarly for $P_{jk}$ . And we are interested in the following ratio:

$F(w_i, w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$ .

The tilde means “context,” but we will later assume it is also a word. Citing the example from their paper, take $i$ as ice, and $j$ as steam. if $k$ is solid, then the ratio is expected to be large; or if $k$ is gas, then it is expected to be low. But if $k$ is water, which are related to both, or fashion, which is related to none, then the ratio is expected to be approximately 1.

And the addition and subtraction in Word2Vec is similar to this. We want the ratio to be like the subtraction as in Word2Vec (and multiplication as in addition), then we should modify the function $F$ such that,

$F(w_i - w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$ .

On the other hand, the input arguments of $F$ are vectors, but the output is a scalar. We avoid the issue by making the input argument as a dot product,

$F( (w_i - w_j)^T \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$ .

In NLP, the word-word co-occurrence matrices are symmetric, and our function $F$ should also be invariant under switching the labeling. We first require $F$ is be a homomorphism,

$F((w_i - w_j)^T \tilde{w}_k) = \frac{F(w_i^T \tilde{w}_k) }{ F(w_j^T \tilde{w}_k)}$ ,

where we define,

$F(w_i^T \tilde{w}_k) = P_{ik} = \frac{X_{ik}}{X_i}$ .

It is clear that $F$ is an exponential function, but to ensure symmetry, we further define:

$w_i^T \tilde{w}_k + b_i + \tilde{b}_k = \log X_{ik}$ .

As a result of this equation, the authors defined the following cost function to optimize for GloVe model:

$J = \sum_{i, j=1}^V f(X_{ij}) \left( w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ik} \right)^2$ ,

where $w_j$ , $\tilde{w}_j$ , $b_i$ , and $\tilde{b}_j$ are parameters to learn. $f(x)$ is a weighting function. (Refer the details to the paper.) [Pennington, Socher & Manning 2014]

As Radim Řehůřek said in his blog entry, [Řehůřek 2014] it is a neat paper, but their evaluation is crappy.

This theory explained why certain similar relations can be achieved, such as Paris – France is roughly equal to Beijing – China, as both can be transformed to the ratio in the definition of $F$ above.

It is a neat paper, as it employs optimization theory and probability theory, without any dark box deep learning.

K.-Y. Ho, “Word Embedding Algorithms,” WordPress (2016).
K.-Y. Ho, “Toying with Word2Vec,” WordPress (2015).
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, “Distributed representations of words and phrases and their compositionality,” Advances in neural information processing systems, 3111-3119 (2013).
Google Project: Word2Vec
Y. Goldberg, O. Levy, “word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method,” arXiv:1402.3722 (2014).
X. Rong, “word2vec Parameter Learning Explained,” arXiv:1411.2738 (2014).
M. Czerny, “Modern Methods for Sentiment Analysis“, District Data Labs (2015).
GloVe: Global Vectors for Word Representation.
glove-python package.
CRAN: text2vec. (GloVe Word Embeddings)
TextMiner, “Getting Started with Word2Vec and GloVe in Python,” TextMiningOnline (2015).
J. Pennington, R. Socher, C. D. Manning, “GloVe: Global Vectors for Word Representation,” Empirical Methods in Natural Language Processing (EMNLP), pp. 1532-1543 (2014). [PDF]
R. Řehůřek, “Making sense of word2vec“, RaRe Technologies (2014).
K.-Y. Ho, “LDA2Vec: a hybrid of LDA and Word2Vec,” WordPress (2016).
K.-Y. Ho, “Local and Global: Words and Topics,” WordPress (2016).

9 thoughts on “Probabilistic Theory of Word Embeddings: GloVe”

Add yours

Pingback: Short Text Categorization using Deep Neural Networks and Word-Embedding Models – Everything about Data Analytics
Pingback: Sammon Embedding with Tensorflow – Everything about Data Analytics
Charles says:

July 12, 2017 at 4:39 pm

Hi, thank you for this nice article !
You conclude saying “It is a neat paper, as it employs optimization theory and probability theory, without any dark box deep learning.”. Which algorithm do you refer to? Because I know Word2vec is said to be shallow, not deep.
Also, would you say that GloVe is more probabilistic than Word2vec? If yes, why? How does it help GloVe perform better?
Thank you so much !

LikeLike

Reply
1. stephenhky says:
  
  July 12, 2017 at 8:57 pm
  
  For the conclusion, I refer to the GloVe paper as the neat paper. If you go through the paper, there are no neural networks involved, and the cost function can be described using probability theory. The explanation is clear.
  
  Yes, Word2Vec is a shallow networks.
  
  If by probabilistic you mean the algorithm being stochastic, I would agree that quite likely the gradient optimization is a stochastic one; If you mean it is a theory derived from Bayesian probability, I would agree too. But from what I know, it performs as good as Word2Vec.
  
  LikeLike
  
  Reply
Pingback: Neural-Network Representation of Quantum Many-Body States – Everything about Data Analytics
Tuan-Anh Nguyen says:

May 19, 2018 at 10:43 pm

We first require F is be a homomorphism,

F((w_i – w_j)^T \tilde{w}_k) = \frac{F(w_i^T \tilde{w}_k) }{ F(w_j^T \tilde{w}_k)} (*)

Hi I don’t understand this one, can you explain more detailed about Why we have to require F be a homomorphism between such 2 groups (R, +) and (R>0, X)? Also why from this equation:

F( (w_i – w_j)^T \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}.

we can derive above equation (*) ?

Thank you so much

LikeLike

Reply
1. Tuan-Anh Nguyen says:
  
  May 19, 2018 at 10:43 pm
  
  We first require F is be a homomorphism,
  
  F((w_i – w_j)^T \tilde{w}_k) = \frac{F(w_i^T \tilde{w}_k) }{ F(w_j^T \tilde{w}_k)} (*)
  
  Hi I don’t understand this one, can you explain more detailed about Why we have to require F be a homomorphism between such 2 groups (R, +) and (R>0, X)? Also why from this equation:
  
  F( (w_i – w_j)^T \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}.
  
  we can derive above equation (*) ?
  
  Thank you so much
  
  LikeLike
  
  Reply
  1. stephenhky says:
    
    May 23, 2018 at 4:25 pm
    
    Thank you for your question. Because we want the function to be invariant under relabeling, we want it to be a homomorphism. Then why should it be multiplication instead of addition? I think it is because of the context (k in the equation), which we want to be cancelled out while relabeling i and j.
    
    LikeLike
Amit says:

June 27, 2018 at 7:35 am

Can you provide the example with some words data?

LikeLike

Reply

Share this:

Related

9 thoughts on “Probabilistic Theory of Word Embeddings: GloVe”

Add yours

Leave a comment Cancel reply