Much about the use of word-embedding models such as Word2Vec and GloVe have been covered. However, how to measure the similarity between phrases or documents? One natural choice is the cosine similarity, as I have toyed with in a previous post. However, it smoothed out the influence of each word. Two years ago, a group in Washington University in St. Louis proposed the Word Mover’s Distance (WMD) in a PMLR paper that captures the relations between words, not simply by distance, but also the “transportation” from one phrase to another conveyed by each word. This Word Mover’s Distance (WMD) can be seen as a special case of Earth Mover’s Distance (EMD), or Wasserstein distance, the one people talked about in Wasserstein GAN. This is better than bag-of-words (BOW) model in a way that the word vectors capture the semantic similarities between words.

# Word Mover’s Distance (WMD)

The formulation of WMD is beautiful. Consider the embedded word vectors $\mathbf{X} \in R^{d \times n}$, where $d$ is the dimension of the embeddings, and $n$ is the number of words. For each phrase, there is a normalized BOW vector $d \in R^n$, and $d_i = \frac{c_i}{\sum_i c_i}$, where $i$‘s denote the word tokens. The distance between words are the Euclidean distance of their embedded word vectors, denoted by $c(i, j) = || \mathbf{x}_i - \mathbf{x}_j ||_2$, where $i$ and $j$ denote word tokens. The document distance, which is WMD here, is defined by $\sum_{i, j} \mathbf{T}_{i j} c(i, j)$, where $\mathbf{T}$ is a $n \times n$ matrix. Each element $\mathbf{T}_{ij} \geq 0$ denote how nuch of word $i$ in the first document (denoted by $\mathbf{d}$) travels to word $j$ in the new document (denoted by $\mathbf{d}'$).

Then the problem becomes the minimization of the document distance, or the WMD, and is formulated as:

$\text{min}_{\mathbf{T} \geq 0} \sum_{i, j=1}^n \mathbf{T}_{ij} c(i, j)$,

given the constraints:

$\sum_{j=1}^n \mathbf{T}_{ij} = d_i$, and

$\sum_{i=1}^n \mathbf{T}_{ij} = d_j'$.

This is essentially a simplified case of the Earth Mover’s distance (EMD), or the Wasserstein distance. (See the review by Gibbs and Su.)

# Using PuLP

The WMD is essentially a linear optimization problem. There are many optimization packages on the market, and my stance is that, for those common ones, there are no packages that are superior than others. In my job, I happened to handle a missing data problem, in turn becoming a non-linear optimization problem with linear constraints, and I chose limSolve, after I shop around. But I actually like a lot of other packages too. For WMD problem, I first tried out cvxopt first, which should actually solve the exact same problem, but the indexing is hard to maintain. Because I am dealing with words, it is good to have a direct hash map, or a dictionary. I can use the Dictionary class in gensim. But I later found out I should use PuLP, as it allows indices with words as a hash map (dict in Python), and WMD is a linear programming problem, making PuLP is a perfect choice, considering code efficiency.

An example of using PuLP can be demonstrated by the British 1997 UG Exam, as in the first problem of this link, with the Jupyter Notebook demonstrating this.

# Implementation of WMD using PuLP

The demonstration can be found in the Jupyter Notebook.

from itertools import product
from collections import defaultdict

import numpy as np
from scipy.spatial.distance import euclidean
import pulp
import gensim

Then define the functions the gives the BOW document vectors:

def tokens_to_fracdict(tokens):
cntdict = defaultdict(lambda : 0)
for token in tokens:
cntdict[token] += 1
totalcnt = sum(cntdict.values())
return {token: float(cnt)/totalcnt for token, cnt in cntdict.items()}

Then implement the core calculation. Note that PuLP is actually a symbolic computing package. This function return a pulp.LpProblem class:

def word_mover_distance_probspec(first_sent_tokens, second_sent_tokens, wvmodel, lpFile=None):
all_tokens = list(set(first_sent_tokens+second_sent_tokens))
wordvecs = {token: wvmodel[token] for token in all_tokens}

first_sent_buckets = tokens_to_fracdict(first_sent_tokens)
second_sent_buckets = tokens_to_fracdict(second_sent_tokens)

T = pulp.LpVariable.dicts('T_matrix', list(product(all_tokens, all_tokens)), lowBound=0)

prob = pulp.LpProblem('WMD', sense=pulp.LpMinimize)
prob += pulp.lpSum([T[token1, token2]*euclidean(wordvecs[token1], wordvecs[token2])
for token1, token2 in product(all_tokens, all_tokens)])
for token2 in second_sent_buckets:
prob += pulp.lpSum([T[token1, token2] for token1 in first_sent_buckets])==second_sent_buckets[token2]
for token1 in first_sent_buckets:
prob += pulp.lpSum([T[token1, token2] for token2 in second_sent_buckets])==first_sent_buckets[token1]

if lpFile!=None:
prob.writeLP(lpFile)

prob.solve()

return prob

To extract the value, just run pulp.value(prob.objective)

We use Google Word2Vec. Refer the $\mathbf{T}$ matrices in the Jupyter Notebook. Running this by a few examples:

1. document1 = President, talk, Chicago
document2 = President, speech, Illinois
WMD = 2.88587622936
2. document1 = physician, assistant
document2 = doctor
WMD = 2.8760048151
3. document1 = physician, assistant
document2 = doctor, assistant
WMD = 1.00465738773
(compare with example 2!)
4. document1 = doctors, assistant
document2 = doctor, assistant
WMD = 1.02825379372
(compare with example 3!)
5. document1 = doctor, assistant
document2 = doctor, assistant
WMD = 0.0
(totally identical; compare with example 3!)

There are more examples in the notebook.

# Conclusion

WMD is a good metric comparing two documents or sentences, by capturing the semantic meanings of the words. It is more powerful than BOW model as it captures the meaning similarities; it is more powerful than the cosine distance between average word vectors, as the transfer of meaning using words from one document to another is considered. But it is not immune to the problem of misspelling.

This algorithm works well for short texts. However, when the documents become large, this formulation will be computationally expensive. The author actually suggested a few modifications, such as the removal of constraints, and word centroid distances.

Example codes can be found in my Github repository: stephenhky/PyWMD.

Embedding algorithms, especially word-embedding algorithms, have been one of the recurrent themes of this blog. Word2Vec has been mentioned in a few entries (see this); LDA2Vec has been covered (see this); the mathematical principle of GloVe has been elaborated (see this); I haven’t even covered Facebook’s fasttext; and I have not explained the widely used t-SNE and Kohonen’s map (self-organizing map, SOM).

I have also described the algorithm of Sammon Embedding, (see this) which attempts to capture the likeliness of pairwise Euclidean distances, and I implemented it using Theano. This blog entry is about its implementation in Tensorflow as a demonstration.

Let’s recall the formalism of Sammon Embedding, as outlined in the previous entry:

Assume there are high dimensional data described by $d$-dimensional vectors, $X_i$ where $i=1, 2, \ldots, N$. And they will be mapped into vectors $Y_i$, with dimensions 2 or 3. Denote the distances to be $d_{ij}^{*} = \sqrt{| X_i - X_j|^2}$ and $d_{ij} = \sqrt{| Y_i - Y_j|^2}$. In this problem, $Y_i$ are the variables to be learned. The cost function to minimize is

$E = \frac{1}{c} \sum_{i,

where $c = \sum_{i.

Unlike in previous entry and original paper, I am going to optimize it using first-order gradient optimizer. If you are not familiar with Tensorflow, take a look at some online articles, for example, “Tensorflow demystified.” This demonstration can be found in this Jupyter Notebook in Github.

First of all, import all the libraries required:

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf


Like previously, we want to use the points clustered around at the four nodes of a tetrahedron as an illustration, which is expected to give equidistant clusters. We sample points around them, as shown:

tetrahedron_points = [np.array([0., 0., 0.]), np.array([1., 0., 0.]), np.array([np.cos(np.pi/3), np.sin(np.pi/3), 0.]), np.array([0.5, 0.5/np.sqrt(3), np.sqrt(2./3.)])]

sampled_points = np.concatenate([np.random.multivariate_normal(point, np.eye(3)*0.0001, 10) for point in tetrahedron_points])

init_points = np.concatenate([np.random.multivariate_normal(point[:2], np.eye(2)*0.0001, 10) for point in tetrahedron_points])


Retrieve the number of points, N, and the resulting dimension, d:

N = sampled_points.shape[0]
d = sampled_points.shape[1]


One of the most challenging technical difficulties is to calculate the pairwise distance. Inspired by this StackOverflow thread and Travis Hoppe’s entry on Thomson’s problem, we know it can be computed. Assuming Einstein’s convention of summation over repeated indices, given vectors $a_{ik}$, the distance matrix is:

$D_{ij} = (a_{ik}-a_{jk}) (a_{ik} - a_{jk})^T = a_{ik} a_{ik} + a_{jk} a_{jk} - 2 a_{ik} a_{jk}$,

where the first and last terms are simply the norms of the vectors. After computing the matrix, we will flatten it to vectors, for technical reasons omitted to avoid gradient overflow:

X = tf.placeholder('float')
Xshape = tf.shape(X)

sqX = tf.reduce_sum(X*X, 1)
sqX = tf.reshape(sqX, [-1, 1])
sqDX = sqX - 2*tf.matmul(X, tf.transpose(X)) + tf.transpose(sqX)
sqDXarray = tf.stack([sqDX[i, j] for i in range(N) for j in range(i+1, N)])
DXarray = tf.sqrt(sqDXarray)

Y = tf.Variable(init_points, dtype='float')
sqY = tf.reduce_sum(Y*Y, 1)
sqY = tf.reshape(sqY, [-1, 1])
sqDY = sqY - 2*tf.matmul(Y, tf.transpose(Y)) + tf.transpose(sqY)
sqDYarray = tf.stack([sqDY[i, j] for i in range(N) for j in range(i+1, N)])
DYarray = tf.sqrt(sqDYarray)


And DXarray and DYarray are the vectorized pairwise distances. Then we defined the cost function according to the definition:

Z = tf.reduce_sum(DXarray)*0.5
numerator = tf.reduce_sum(tf.divide(tf.square(DXarray-DYarray), DXarray))*0.5
cost = tf.divide(numerator, Z)


update_rule = tf.assign(Y, Y-0.01*grad_cost/lapl_cost)
init = tf.global_variables_initializer()


The last line initializes all variables in the Tensorflow session when it is run. Then start a Tensorflow session, and initialize all variables globally:

sess = tf.Session()
sess.run(init)


Then run the algorithm:

nbsteps = 1000
c = sess.run(cost, feed_dict={X: sampled_points})
print "epoch: ", -1, " cost = ", c
for i in range(nbsteps):
sess.run(train, feed_dict={X: sampled_points})
c = sess.run(cost, feed_dict={X: sampled_points})
print "epoch: ", i, " cost =


Then extract the points and close the Tensorflow session:

calculated_Y = sess.run(Y, feed_dict={X: sampled_points})
sess.close()


Plot it using matplotlib:

embed1, embed2 = calculated_Y.transpose()
plt.plot(embed1, embed2, 'ro')


This gives, as expected,

This code for Sammon Embedding has been incorporated into the Python package mogu, which is a collection of numerical routines. You can install it, and call:

from mogu.embed import sammon_embedding
calculated_Y = sammon_embedding(sampled_points, init_points)


The topic of word embedding algorithms has been one of the interests of this blog, as in this entry, with Word2Vec [Mikilov et. al. 2013] as one of the main examples. It is a great tool for text mining, (for example, see [Czerny 2015],) as it reduces the dimensions needed (compared to bag-of-words model). As an algorithm borrowed from computer vision, a lot of these algorithms use deep learning methods to train the model, while it was not exactly sure why it works. Despite that, there are many articles talking about how to train the model. [Goldberg & Levy 2014, Rong 2014 etc.] Addition and subtraction of the word vectors show amazing relationships that carry semantic values, as I have shown in my previous blog entry. [Ho 2015]

However, Tomas Mikolov is no longer working in Google, making the development of this algorithm discontinued. As a follow-up of their work, Stanford NLP group later proposed a model, called GloVe (Global Vectors), that embeds word vectors using probabilistic methods. [Pennington, Socher & Manning 2014] It can be implemented in the package glove-python in python, and text2vec in R (or see their CRAN post).  Their paper is neatly written, and a highly recommended read.

To explain the theory of GloVe, we start with some basic probabilistic picture in basic natural language processing (NLP). We suppose the relation between the words occur in certain text windows within a corpus, but the details are not important here. Assume that $i$, $j$, and $k$ are three words, and the conditional probability $P_{ik}$ is defined as

$P_{ij} = P(j | i) = \frac{X_{ij}}{X_i}$,

where $X$‘s are the counts, and similarly for $P_{jk}$. And we are interested in the following ratio:

$F(w_i, w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$.

The tilde means “context,” but we will later assume it is also a word. Citing the example from their paper, take $i$ as ice, and $j$ as steam. if $k$ is solid, then the ratio is expected to be large; or if $k$ is gas, then it is expected to be low. But if $k$ is water, which are related to both, or fashion, which is related to none, then the ratio is expected to be approximately 1.

And the addition and subtraction in Word2Vec is similar to this. We want the ratio to be like the subtraction as in Word2Vec (and multiplication as in addition), then we should modify the function $F$ such that,

$F(w_i - w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$.

On the other hand, the input arguments of $F$ are vectors, but the output is a scalar. We avoid the issue by making the input argument as a dot product,

$F( (w_i - w_j)^T \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$.

In NLP, the word-word co-occurrence matrices are symmetric, and our function $F$ should also be invariant under switching the labeling. We first require $F$ is be a homomorphism,

$F((w_i - w_j)^T \tilde{w}_k) = \frac{F(w_i^T \tilde{w}_k) }{ F(w_j^T \tilde{w}_k)}$,

where we define,

$F(w_i^T \tilde{w}_k) = P_{ik} = \frac{X_{ik}}{X_i}$.

It is clear that $F$ is an exponential function, but to ensure symmetry, we further define:

$w_i^T \tilde{w}_k + b_i + \tilde{b}_k = \log X_{ik}$.

As a result of this equation, the authors defined the following cost function to optimize for GloVe model:

$J = \sum_{i, j=1}^V f(X_{ij}) \left( w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ik} \right)^2$,

where $w_j$, $\tilde{w}_j$, $b_i$, and $\tilde{b}_j$ are parameters to learn. $f(x)$ is a weighting function. (Refer the details to the paper.) [Pennington, Socher & Manning 2014]

As Radim Řehůřek said in his blog entry, [Řehůřek 2014] it is a neat paper, but their evaluation is crappy.

This theory explained why certain similar relations can be achieved, such as Paris – France is roughly equal to Beijing – China, as both can be transformed to the ratio in the definition of $F$ above.

It is a neat paper, as it employs optimization theory and probability theory, without any dark box deep learning.

Justin Trudeau, the Canadian Prime Minister, rocked the world recently by talking about quantum computers in front of reporters at Perimeter Institute…

No fault can be found in his answer, although his answer cannot be more layman. A few physicists were challenged to give a layman introduction about quantum computers, and they are equally well. (See MacLean’s article.)

The research about quantum computers has been for decades, and it is not until recently some commercial products are available in the market.

### Qubits and Quantum Entanglement

Trudeau is right that in quantum computers, a bit, called a qubit, is more than just 0 and 1, but a complicated combination of them. It can be characterized by a quantum state $|\psi \rangle = \alpha |0\rangle + \beta |1\rangle$, where $\alpha$ and $\beta$ are complex numbers. Such a state can be geometrically demonstrated as a Bloch sphere, where the possibility of a state is infinitely more than simply up and down.

Besides qubits, quantum entanglement makes quantum computing unique, because with entanglement it allows algorithms to be calculated in parallel at the same time by its nature. Quantum entanglement concerns the correlation between two qubits in a way that the quantum state of the two qubits cannot be separated into two independent quantum states, each for one qubit. There are many ways to quantify the entanglement of a bipartite state (consisting of two qubits), such as entanglement entropy (similar to Shannon entropy, see this, where the probability is on the partial density matrix calculated by Schmidt decomposition), and negativity. [Peres, 1996]

The qubits and entanglement make quantum algorithms possible. It is a big research field, bordering physics and computer science. See “Quantum Algorithm Zoo” cataloged by NIST for them. Shor’s algorithm and Grover’s algorithm are some of the famous ones.

There are two types of quantum computers in general, namely, the universal quantum computer (UQC), and non-universal adiabatic quantum computer (AQC).

### Universal Quantum Computer (UQC)

UQC, the quantum computers that exploit quantum properties to perform computations that are classically infeasible, has been studied for decades in academia. Not only the quantum algorithms, the realization of hardware has always been one of the hot topics of scientific research. Although much time and money has been invested, it is still far from industrial and commercial use.

The first UQC on earth was made in MIT in 1998, by Isaac Chuang and his colleagues, using NMR techniques. [Chuang et. al., 1998] There are also other experimental realization of qubits, such as optical cavity, Josephson junction, quantum dots, nanomechanical resonator etc. Representations depend on the physics systems.

If this is successfully realized in industrialized use, it speeds up a lot of computations by fully applying the quantum algorithms.

AQC does not fully exploit the quantum properties. While there are qubits, the algorithms may not be fully applied. Instead, it partially exploits quantum properties to accelerate current algorithms. It takes advantage of adiabatic theorem, which states that a slow perturbation to a quantum system remains the instantaneous eigenstate. This means, if the quantum system starts at a ground state, it stays at the ground state after the slow perturbation due to some external field. This process possibly involves quantum tunneling. AQC is actually a type of quantum annealing.

There are already some commercial applications of AQC. D-Wave Systems made their first AQC in the world, and a number of big companies such as Lockheed Martin and Google purchased them. D-Wave systems are based on Josephson junctions. It is programmable, and most suited to optimization problems. One can write their own optimization problem in terms of the Ising model:

$H = \sum_i h_i x_i + \sum_{\langle i, j \rangle} J_{ij} x_i x_j$,

The system will start at an initial system with its ground state, and slowly changing the interactions to this Hamiltonian. The state will be the ground state, or the optimized solution, of the problem.

We can immediately see that, in the era of big data, this is very useful as machine learning problems are optimization problems. For example, QxBranch, a startup based on Washington, DC, developed tools using D-Wave Systems to perform some machine learning algorithms.

### Topological Quantum Computer

In some sense, an AQC is only a partial quantum computer, which does not employ full quantum properties. However, a UQC is hard to achieve because of quantum decoherence. A state can exist for only a very short time. There exist ways to avoid this, for example, optimal dynamic decoupling. [Fu et. al., 2009] Another fascinating idea to overcome this is topological quantum computing.

A topological quantum computer is a theoretical quantum computer that is based on anyons, a two-dimensional quasi-particles with world lines crossing over in three-dimensional world. It was first introduced in 1997 by Alexei Kitaev. The idea is to exploit the ideas of topology in anyons to preserve the state, because it is extremely costly to change the topology of a system. In 2005, Das Sarma, Freedman, and Nayak proposed using fractional quantum Hall system to achieve this. [Das Sarma et. al., 2006] Some ideas such as Majorana fermions have been proposed too. All in all, topological quantum computing is still a theoretical idea.

Entropy is one of the most fascinating ideas in the history of mathematical sciences.

In Phenomenological Thermodynamics…

Entropy was introduced into thermodynamics in the 19th century. Like the free energies, it describes the state of a thermodynamic system. At the beginning, entropy is merely phenomenological. The physicists found it useful to incorporate the description using entropy in the second law of thermodynamics with clarity and simplicity, instead of describing it as convoluted heat flow (which is what it is originally about) among macroscopic systems (say, the heat flow from the hotter pot of water to the air of the room). It did not carry any statistical meaning at all until 1870s.

In Statistical Physics…

Ludwig Boltzmann (1844-1906)

The statistical meaning of entropy was developed by Ludwig Boltzmann, a pioneer of statistical physics, who studied the connection of the macroscopic thermodynamic behavior to the microscopic components of the system. For example, he described the temperature to be the average of the fluctuating kinetic energy of the particles. And he formulated the entropy to be

$S = - k_B \sum_i p_i \log p_i$,

where i is the label for each microstate, and $k_B$ is the Boltzmann’s constant. And in a closed system, the total entropy never decreases.

Information Theory and Statistical Physics United

In statistical physics, Boltzmann’s assumption of equal a priori equilibrium properties is an important assumption. However, in 1957, E. T. Jaynes published a paper relating information theory and statistical physics in Physical Review indicating that merely the principle of maximum entropy is sufficient to describe equilibrium statistical system. [Jaynes 1957] In statistical physics, we are aware that systems can be described as canonical ensemble, or a softmax function (normalized exponential), i.e., $p_i \propto \exp(-\beta E_i)$. This can be easily derived by the principle of maximum entropy and the conservation of energy. Or mathematically, the probabilities for all states i with energies $E_i$ can be obtained by maximizing the entropy

$S = -\sum_i p_i \log p_i$,

under the constraints

$\sum_i p_i = 1$, and
$\sum_i p_i E_i = E$,

where E is a constant. The softmax distribution can be obtained by this simple optimization problem, using basic variational calculus (Euler-Lagrange equation) and Lagrange’s multipliers.

The principle of maximum entropy can be found in statistics too. For example, the form of Gaussian distribution can be obtained by maximizing the entropy

$S = - \int dx \cdot p(x) \log p(x)$,

with the knowledge of the mean $\mu$ and the variance $\sigma^2$, or mathematically speaking, under the constraints,

$\int dx \cdot p(x) = 1$,
$\int dx \cdot x p(x) = \mu$, and
$\int dx \cdot (x-\mu)^2 p(x) = \sigma^2$.

In any statistical systems, the probability distributions can be computed with the principle of maximum entropy, as Jaynes put it [Jaynes 1957]

It is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.

In statistical physics, entropy is roughly a measure how “chaotic” a system is. In information theory, entropy is a measure how surprising the information is. The smaller the entropy is, the more surprising the information is. And it assumes no additional information. Without constraints other than the normalization, the probability distribution is that all $p_i$‘s are equal, which is equivalent to the least surprise. Lê Nguyên Hoang, a scientist at Massachusetts Institute of Technology, wrote a good blog post about the meaning of entropy in information theory. [Hoang 2013] In information theory, the entropy is given by

$S = -\sum_i p_i \log_2 p_i$,

which is different from the thermodynamic entropy by the constant $k_B$ and the coefficient $\log 2$. The entropies in information theory and statistical physics are equivalent.

Entropy in Natural Language Processing (NLP)

The principle of maximum entropy assumes nothing other than the given information to compute the most optimized probability distribution, which makes it a desirable algorithm in machine learning. It can be regarded as a supervised learning algorithm, with the features being ${p, c}$, where p is the property calculated, and c is the class. The probability for ${p, c}$ is proportional to $\exp(- \alpha \text{\#}({p, c}))$, where $\alpha$ is the coefficient to be found during training. There are some technical note to compute all these coefficients, which essentially involves solving a system of algebraic equations numerically using techniques such as generalized iterative scaling (GIS).

Does it really assume no additional information? No. The way you construct the features is how you add information. But once the features are defined, the calculation depends on the training data only.

The classifier based on maximum entropy has found its application in part-of-speech (POS) tagging, machine translation (ML), speech recognition, and text mining. A good review was written by Berger and Della Pietra’s. [Berger, Della Pietra, Della Pietra 1996] A lot of open-source softwares provide maximum entropy classifiers, such as Python NLTK and Apache OpenNLP.

In Quantum Computation…

One last word, entropy is used to describe quantum entanglement. A composite bipartite quantum system is said to be entangled if its subsystems must be described in a mixed state, i.e., it must be statistical if one of the subsystems is only considered. Then the entanglement entropy is given by [Nielssen, Chuang 2011]

$S = -\sum_i p_i \log p_i$,

which is essentially the same formula. The more entangled the system is, the larger the entanglement entropy. However, composite quantum systems tend to decrease their entropy over time though.