Much about the use of word-embedding models such as Word2Vec and GloVe have been covered. However, how to measure the similarity between phrases or documents? One natural choice is the cosine similarity, as I have toyed with in a previous post. However, it smoothed out the influence of each word. Two years ago, a group in Washington University in St. Louis proposed the Word Mover’s Distance (WMD) in a PMLR paper that captures the relations between words, not simply by distance, but also the “transportation” from one phrase to another conveyed by each word. This Word Mover’s Distance (WMD) can be seen as a special case of Earth Mover’s Distance (EMD), or Wasserstein distance, the one people talked about in Wasserstein GAN. This is better than bag-of-words (BOW) model in a way that the word vectors capture the semantic similarities between words.

# Word Mover’s Distance (WMD)

The formulation of WMD is beautiful. Consider the embedded word vectors $\mathbf{X} \in R^{d \times n}$, where $d$ is the dimension of the embeddings, and $n$ is the number of words. For each phrase, there is a normalized BOW vector $d \in R^n$, and $d_i = \frac{c_i}{\sum_i c_i}$, where $i$‘s denote the word tokens. The distance between words are the Euclidean distance of their embedded word vectors, denoted by $c(i, j) = || \mathbf{x}_i - \mathbf{x}_j ||_2$, where $i$ and $j$ denote word tokens. The document distance, which is WMD here, is defined by $\sum_{i, j} \mathbf{T}_{i j} c(i, j)$, where $\mathbf{T}$ is a $n \times n$ matrix. Each element $\mathbf{T}_{ij} \geq 0$ denote how nuch of word $i$ in the first document (denoted by $\mathbf{d}$) travels to word $j$ in the new document (denoted by $\mathbf{d}'$).

Then the problem becomes the minimization of the document distance, or the WMD, and is formulated as:

$\text{min}_{\mathbf{T} \geq 0} \sum_{i, j=1}^n \mathbf{T}_{ij} c(i, j)$,

given the constraints:

$\sum_{j=1}^n \mathbf{T}_{ij} = d_i$, and

$\sum_{i=1}^n \mathbf{T}_{ij} = d_j'$.

This is essentially a simplified case of the Earth Mover’s distance (EMD), or the Wasserstein distance. (See the review by Gibbs and Su.)

# Using PuLP

The WMD is essentially a linear optimization problem. There are many optimization packages on the market, and my stance is that, for those common ones, there are no packages that are superior than others. In my job, I happened to handle a missing data problem, in turn becoming a non-linear optimization problem with linear constraints, and I chose limSolve, after I shop around. But I actually like a lot of other packages too. For WMD problem, I first tried out cvxopt first, which should actually solve the exact same problem, but the indexing is hard to maintain. Because I am dealing with words, it is good to have a direct hash map, or a dictionary. I can use the Dictionary class in gensim. But I later found out I should use PuLP, as it allows indices with words as a hash map (dict in Python), and WMD is a linear programming problem, making PuLP is a perfect choice, considering code efficiency.

An example of using PuLP can be demonstrated by the British 1997 UG Exam, as in the first problem of this link, with the Jupyter Notebook demonstrating this.

# Implementation of WMD using PuLP

The demonstration can be found in the Jupyter Notebook.

from itertools import product
from collections import defaultdict

import numpy as np
from scipy.spatial.distance import euclidean
import pulp
import gensim

Then define the functions the gives the BOW document vectors:

def tokens_to_fracdict(tokens):
cntdict = defaultdict(lambda : 0)
for token in tokens:
cntdict[token] += 1
totalcnt = sum(cntdict.values())
return {token: float(cnt)/totalcnt for token, cnt in cntdict.items()}

Then implement the core calculation. Note that PuLP is actually a symbolic computing package. This function return a pulp.LpProblem class:

def word_mover_distance_probspec(first_sent_tokens, second_sent_tokens, wvmodel, lpFile=None):
all_tokens = list(set(first_sent_tokens+second_sent_tokens))
wordvecs = {token: wvmodel[token] for token in all_tokens}

first_sent_buckets = tokens_to_fracdict(first_sent_tokens)
second_sent_buckets = tokens_to_fracdict(second_sent_tokens)

T = pulp.LpVariable.dicts('T_matrix', list(product(all_tokens, all_tokens)), lowBound=0)

prob = pulp.LpProblem('WMD', sense=pulp.LpMinimize)
prob += pulp.lpSum([T[token1, token2]*euclidean(wordvecs[token1], wordvecs[token2])
for token1, token2 in product(all_tokens, all_tokens)])
for token2 in second_sent_buckets:
prob += pulp.lpSum([T[token1, token2] for token1 in first_sent_buckets])==second_sent_buckets[token2]
for token1 in first_sent_buckets:
prob += pulp.lpSum([T[token1, token2] for token2 in second_sent_buckets])==first_sent_buckets[token1]

if lpFile!=None:
prob.writeLP(lpFile)

prob.solve()

return prob

To extract the value, just run pulp.value(prob.objective)

We use Google Word2Vec. Refer the $\mathbf{T}$ matrices in the Jupyter Notebook. Running this by a few examples:

1. document1 = President, talk, Chicago
document2 = President, speech, Illinois
WMD = 2.88587622936
2. document1 = physician, assistant
document2 = doctor
WMD = 2.8760048151
3. document1 = physician, assistant
document2 = doctor, assistant
WMD = 1.00465738773
(compare with example 2!)
4. document1 = doctors, assistant
document2 = doctor, assistant
WMD = 1.02825379372
(compare with example 3!)
5. document1 = doctor, assistant
document2 = doctor, assistant
WMD = 0.0
(totally identical; compare with example 3!)

There are more examples in the notebook.

# Conclusion

WMD is a good metric comparing two documents or sentences, by capturing the semantic meanings of the words. It is more powerful than BOW model as it captures the meaning similarities; it is more powerful than the cosine distance between average word vectors, as the transfer of meaning using words from one document to another is considered. But it is not immune to the problem of misspelling.

This algorithm works well for short texts. However, when the documents become large, this formulation will be computationally expensive. The author actually suggested a few modifications, such as the removal of constraints, and word centroid distances.

Example codes can be found in my Github repository: stephenhky/PyWMD.

In my previous blog post, I introduced the newly emerged topological data analysis (TDA). Unlike most of the other data analytic algorithms, TDA, concerning the topology as its name tells, cares for the connectivity of points, instead of the distance (according to a metric, whether it is Euclidean, Manhattan, Minkowski or any other). What is the best tools to describe topology?

Physicists use a lot homotopy. But for the sake of computation, it is better to use a scheme that are suited for discrete computation. It turns out that there are useful tools in algebraic topology: homology. But to understand homology, we need to understand what a simplicial complex is.

Gunnar Carlsson [Carlsson 2009] and Afra Zomorodian [Zomorodian 2011] wrote good reviews about them, although from a different path in introducing the concept. I first followed Zomorodian’s review [Zomorodian 2011], then his book [Zomorodian 2009] that filled in a lot of missing links in his review, to a certain point. I recently started reading Carlsson’s review.

One must first understand what a simplicial complex is. Without giving too much technical details, simplicial complex is basically a shape connecting points together. A line is a 1-simplex, connecting two points. A triangle is a 2-simplex. A tetrahedron is a 3-complex. There are other more complicated and unnamed complexes. Any subsets of a simplicial complex are faces. For example, the sides of the triangle are faces. The faces and the sides are the faces of the tetrahedron. (Refer to Wolfram MathWorld for more details. There are a lot of good tutorials online.)

Implementing Simplicial Complex

We can easily encoded this into a python code. I wrote a class SimplicialComplex in Python to implement this. We first import necessary libraries:

import numpy as np
from itertools import combinations
from scipy.sparse import dok_matrix


The first line imports the numpy library, the second the iteration tools necessary for extracting the faces for simplicial complex, the third the sparse matrix implementation in the scipy library (applied on something that I will not go over in this blog entry), and the fourth for some reduce operation.

We want to describe the simplicial complexes in the order of some labels (which can be anything, such as integers or strings). If it is a point, then it can be represented as tuples, as below:

 (1,)

Or if it is a line (a 1-simplex), then

 (1, 2)

Or a 2-simplex as a triangle, then

 (1, 2, 3)

I think you get the gist. The integers 1, 2, or 3 here are simply labels. We can easily store this in the class:

class SimplicialComplex:
def __init__(self, simplices=[]):
self.import_simplices(simplices=simplices)

def import_simplices(self, simplices=[]):
self.simplices = map(lambda simplex: tuple(sorted(simplex)), simplices)
self.face_set = self.faces()


You might observe the last line of the codes above. And it is for calculating all the faces of this complex, and it is implemented in this way:

  def faces(self):
faceset = set()
for simplex in self.simplices:
numnodes = len(simplex)
for r in range(numnodes, 0, -1):
for face in combinations(simplex, r):
return faceset


The faces are intuitively sides of a 2D shape (2-simplex), or planes of a 3D shape (3-simplex). But the faces of a 3-simplex includes the faces of all its faces. All the faces are saved in a field called faceset. If the user wants to retrieve the faces in a particular dimension, they can call this method:

  def n_faces(self, n):
return filter(lambda face: len(face)==n+1, self.face_set)


There are other methods that I am not going over in this blog entry. Now let us demonstrate how to use the class by implementing a tetrahedron.

sc = SimplicialComplex([('a', 'b', 'c', 'd')])


If we want to extract the faces, then enter:

sc.faces()


which outputs:

{('a',),
('a', 'b'),
('a', 'b', 'c'),
('a', 'b', 'c', 'd'),
('a', 'b', 'd'),
('a', 'c'),
('a', 'c', 'd'),
('a', 'd'),
('b',),
('b', 'c'),
('b', 'c', 'd'),
('b', 'd'),
('c',),
('c', 'd'),
('d',)}


We have gone over the basis of simplicial complex, which is the foundation of TDA. We appreciate that the simplicial complex deals only with the connectivity of points instead of the distances between the points. And the homology groups will be calculated based on this. However, how do we obtain the simplicial complex from the discrete data we have? Zomorodian’s review [Zomorodian 2011] gave a number of examples, but I will only go through two of them only. And from this, you can see that to establish the connectivity between points, we still need to apply some sort of distance metrics.

Alpha Complex

An alpha complex is the nerve of the cover of the restricted Voronoi regions. (Refer the details to Zomorodian’s review [Zomorodian 2011], this Wolfram MathWorld entry, or this Wolfram Demonstration.) We can extend the class SimplicialComplex to get a class AlphaComplex:

from scipy.spatial import Delaunay, distance
from operator import or_
from functools import partial

def facesiter(simplex):
for i in range(len(simplex)):
yield simplex[:i]+simplex[(i+1):]

def flattening_simplex(simplices):
for simplex in simplices:
for point in simplex:
yield point

def get_allpoints(simplices):
return set(flattening_simplex(simplices))

def contain_detachededges(simplex, distdict, epsilon):
if len(simplex)==2:
return (distdict[simplex[0], simplex[1]] &gt; 2*epsilon)
else:
return reduce(or_, map(partial(contain_detachededges, distdict=distdict, epsilon=epsilon), facesiter(simplex)))

class AlphaComplex(SimplicialComplex):
def __init__(self, points, epsilon, labels=None, distfcn=distance.euclidean):
self.pts = points
self.labels = range(len(self.pts)) if labels==None or len(labels)!=len(self.pts) else labels
self.epsilon = epsilon
self.distfcn = distfcn
self.import_simplices(self.construct_simplices(self.pts, self.labels, self.epsilon, self.distfcn))

def calculate_distmatrix(self, points, labels, distfcn):
distdict = {}
for i in range(len(labels)):
for j in range(len(labels)):
distdict[(labels[i], labels[j])] = distfcn(points[i], points[j])
return distdict

def construct_simplices(self, points, labels, epsilon, distfcn):
delaunay = Delaunay(points)
delaunay_simplices = map(tuple, delaunay.simplices)
distdict = self.calculate_distmatrix(points, labels, distfcn)

simplices = []
for simplex in delaunay_simplices:
faces = list(facesiter(simplex))
detached = map(partial(contain_detachededges, distdict=distdict, epsilon=epsilon), faces)
if reduce(or_, detached):
if len(simplex)&gt;2:
for face, notkeep in zip(faces, detached):
if not notkeep:
simplices.append(face)
else:
simplices.append(simplex)
simplices = map(lambda simplex: tuple(sorted(simplex)), simplices)
simplices = list(set(simplices))

allpts = get_allpoints(simplices)
for point in (set(labels)-allpts):
simplices += [(point,)]

return simplices


The scipy package already has a package to calculate Delaunay region. The function contain_detachededges is for constructing the restricted Voronoi region from the calculated Delaunay region.

This class demonstrates how an Alpha Complex is constructed, but this runs slowly once the number of points gets big!

Vietoris-Rips (VR) Complex

Another commonly used complex is called the Vietoris-Rips (VR) Complex, which connects points as the edge of a graph if they are close enough. (Refer to Zomorodian’s review [Zomorodian 2011] or this Wikipedia page for details.) To implement this, import the famous networkx originally designed for network analysis.

import networkx as nx
from scipy.spatial import distance
from itertools import product

class VietorisRipsComplex(SimplicialComplex):
def __init__(self, points, epsilon, labels=None, distfcn=distance.euclidean):
self.pts = points
self.labels = range(len(self.pts)) if labels==None or len(labels)!=len(self.pts) else labels
self.epsilon = epsilon
self.distfcn = distfcn
self.network = self.construct_network(self.pts, self.labels, self.epsilon, self.distfcn)
self.import_simplices(map(tuple, list(nx.find_cliques(self.network))))

def construct_network(self, points, labels, epsilon, distfcn):
g = nx.Graph()
zips = zip(points, labels)
for pair in product(zips, zips):
if pair[0][1]!=pair[1][1]:
dist = distfcn(pair[0][0], pair[1][0])
if dist&lt;epsilon: