Word Embedding Algorithms

Embedding has been hot in recent years partly due to the success of Word2Vec, (see demo in my previous entry) although the idea has been around in academia for more than a decade. The idea is to transform a vector of integers into continuous, or embedded, representations. Keras, a Python package that implements neural network models (including the ANN, RNN, CNN etc.) by wrapping Theano or TensorFlow, implemented it, as shown in the example below (which converts a vector of 200 features into a continuous vector of 10):

from keras.layers import Embedding
from keras.models import Sequential

# define and compile the embedding model
model = Sequential()
model.add(Embedding(200, 10, input_length=1))
model.compile('rmsprop', 'mse')  # optimizer: rmsprop; loss function: mean-squared error

We can then convert any features from 0 to 199 into vectors of 20, as shown below:

import numpy as np

model.predict(np.array([10, 90, 151]))

It outputs:

array([[[ 0.02915354,  0.03084954, -0.04160764, -0.01752155, -0.00056815,
         -0.02512387, -0.02073313, -0.01154278, -0.00389587, -0.04596512]],

       [[ 0.02981793, -0.02618774,  0.04137352, -0.04249889,  0.00456919,
          0.04393572,  0.04139435,  0.04415271,  0.02636364, -0.04997493]],

       [[ 0.00947296, -0.01643104, -0.03241419, -0.01145032,  0.03437041,
          0.00386361, -0.03124221, -0.03837727, -0.04804075, -0.01442516]]])

Of course, one must not omit a similar algorithm called GloVe, developed by the Stanford NLP group. Their codes have been wrapped in both Python (package called glove) and R (library called text2vec).

Besides Word2Vec, there are other word embedding algorithms that try to complement Word2Vec, although many of them are more computationally costly. Previously, I introduced LDA2Vec in my previous entry, an algorithm that combines the locality of words and their global distribution in the corpus. And in fact, word embedding algorithms with a similar ideas are also invented by other scientists, as I have introduced in another entry.

However, there are word embedding algorithms coming out. Since most English words carry more than a single sense, different senses of a word might be best represented by different embedded vectors. Incorporating word sense disambiguation, a method called sense2vec has been introduced by Trask, Michalak, and Liu. (arXiv:1511.06388). Matthew Honnibal wrote a nice blog entry demonstrating its use.

There are also other related work, such as wang2vec that is more sensitive to word orders.

Big Bang Theory (Season 2, Episode 5): Euclid Alternative

DMV staff: Application?
Sheldon: I’m actually more or a theorist.

Note: feature image taken from Big Bang Theory (CBS).

Continue reading “Word Embedding Algorithms”

Generative-Discriminative Pairs

To briefly state the difference between generative models and discriminative models, I would say a generative model concerns the specification of the joint probability p(\mathbf{x}, y), and a discriminative model that of the conditional probability p(y \ \mathbf{x}). Andrew Ng and Michael Jordan wrote a paper on this with logistic regression and naive Bayes as the example. [Ng, Jordan, 2001]

Assume that we have a set of k features f_k(y, \mathbf{x}). Then a naive Bayes model, as a generative model, is given by

p(y, \mathbf{x}) = \frac{\exp \left[ \sum_k \lambda_k f_k (y, \mathbf{x}) \right]}{\sum_{\tilde{\mathbf{x}}, \tilde{y}} \exp \left[ \sum_k \lambda_k f_k (\tilde{y}, \tilde{\mathbf{x}}) \right]}.

To train a naive Bayes model, the loss function is maximum likelihood.

A logistic regression, and its discrete form as maximum entropy classifiers, is a discriminative model. But Ng and Jordan argued that they are theoretically related as the probability of y given \mathbf{x} can be derived from the naive Bayes model above, i.e.,

p( y | \mathbf{x}) =\frac{\exp \left[ \sum_k \lambda_k f_k (y, \mathbf{x}) \right]}{\sum_{\tilde{y}} \exp \left[ \sum_k \lambda_k f_k (\tilde{y}, \mathbf{x}) \right]}.

To train a logistic regression, the loss function is the least mean-squared difference; and for maximum entropy classifiers, it is, of course, the entropy.

When in application, there is one more difference, which is the importance of p(y) in classification in the generative model, but its absence in the discriminative model.

Another example for generative-discriminative pair is hidden Markov model (HMM) and conditional random field (CRF). [Sutton, McCallum, 2010]

Continue reading “Generative-Discriminative Pairs”

Create a free website or blog at WordPress.com.

Up ↑