Word embedding has been a frequent theme of this blog. But the original embedding has been algorithms that perform a non-linear mapping of higher dimensional data to the lower one. This entry I will talk about one of the most oldest and widely used one: Sammon Embedding, published in 1969. This is an embedding algorithm that preserves the distances between all points. How is it achieved?

Assume there are high dimensional data described by $d$-dimensional vectors, $X_i$ where $i=1, 2, \ldots, N$. And they will be mapped into vectors $Y_i$, with dimensions 2 or 3. Denote the distances to be $d_{ij}^{*} = \sqrt{| X_i - X_j|^2}$ and $d_{ij} = \sqrt{| Y_i - Y_j|^2}$. In this problem, $Y_i$ are the variables to be learned. The cost function to minimize is

$E = \frac{1}{c} \sum_{i,

where $c = \sum_{i. To minimize this, use Newton's method by

$Y_{pq} (m+1) = Y_{pq} (m) - \alpha \Delta_{pq} (m)$,

where $\Delta_{pq} (m) = \frac{\partial E(m)}{\partial Y_{pq}(m)} / \left|\frac{\partial^2 E(m)}{\partial Y_{pq} (m)^2} \right|$, and $\alpha$ is the learning rate.

To implement it, use Theano package of Python to define the cost function for the sake of optimization, and then implement the learning with numpy. Define the cost function with the outline above:

import theano
import theano.tensor as T

import numerical_gradients as ng

# define variables
mf = T.dscalar('mf')         # magic factor / learning rate

# coordinate variables
Xmatrix = T.dmatrix('Xmatrix')
Ymatrix = T.dmatrix('Ymatrix')

# number of points and dimensions (user specify them)
N, d = Xmatrix.shape
_, td = Ymatrix.shape

# grid indices
n_grid = T.mgrid[0:N, 0:N]
ni = n_grid[0].flatten()
nj = n_grid[1].flatten()

# cost function
c_terms, _ = theano.scan(lambda i, j: T.switch(T.lt(i, j),
T.sqrt(T.sum(T.sqr(Xmatrix[i]-Xmatrix[j]))),
0),
sequences=[ni, nj])
c = T.sum(c_terms)

s_term, _ = theano.scan(lambda i, j: T.switch(T.lt(i, j),
T.sqr(T.sqrt(T.sum(T.sqr(Xmatrix[i]-Xmatrix[j])))-T.sqrt(T.sum(T.sqr(Ymatrix[i]-Ymatrix[j]))))/T.sqrt(T.sum(T.sqr(Xmatrix[i]-Xmatrix[j]))),
0),
sequences=[ni, nj])
s = T.sum(s_term)

E = s / c

# function compilation and optimization
Efcn = theano.function([Xmatrix, Ymatrix], E)


And implement the update algorithm with the following function:

import numpy

# training
def sammon_embedding(Xmat, initYmat, alpha=0.3, tol=1e-8, maxsteps=500, return_updates=False):
N, d = Xmat.shape
NY, td = initYmat.shape
if N != NY:
raise ValueError('Number of vectors in Ymat ('+str(NY)+') is not the same as Xmat ('+str(N)+')!')

# iteration
Efcn_X = lambda Ymat: Efcn(Xmat, Ymat)
step = 0
oldYmat = initYmat
oldE = Efcn_X(initYmat)
update_info = {'Ymat': [initYmat], 'cost': [oldE]}
converged = False
while (not converged) and step<=maxsteps:
newYmat = oldYmat - alpha*ng.tensor_gradient(Efcn_X, oldYmat, tol=tol)/ng.tensor_divgrad(Efcn_X, oldYmat, tol=tol)
newE = Efcn_X(newYmat)
if np.all(np.abs(newE-oldE)<tol):
converged = True
oldYmat = newYmat
oldE = newE
step += 1
print 'Step ', step, '\tCost = ', oldE
update_info['Ymat'].append(oldYmat)
update_info['cost'].append(oldE)

# return results
update_info['num_steps'] = step
return oldYmat, update_info
else:
return oldYmat


The above code is stored in the file sammon.py. We can test the algorithm with an example. Remember tetrahedron, a three-dimensional object with four points equidistant from one another. We expect the embedding will reflect this by sampling points around these four points. With the code tetrahedron.py, we implemented it this way:

import argparse

import numpy as np
import matplotlib.pyplot as plt

import sammon as sn

argparser = argparse.ArgumentParser('Embedding points around tetrahedron.')
default='embedded_tetrahedron.png',
help='file name of the output plot')

args = argparser.parse_args()

tetrahedron_points = [np.array([0., 0., 0.]),
np.array([1., 0., 0.]),
np.array([np.cos(np.pi/3), np.sin(np.pi/3), 0.]),
np.array([0.5, 0.5/np.sqrt(3), np.sqrt(2./3.)])]

sampled_points = np.concatenate([np.random.multivariate_normal(point, np.eye(3)*0.0001, 10)
for point in tetrahedron_points])

init_points = np.concatenate([np.random.multivariate_normal(point[:2], np.eye(2)*0.0001, 10)
for point in tetrahedron_points])

embed_points = sn.sammon_embedding(sampled_points, init_points, tol=1e-4)

X, Y = embed_points.transpose()
plt.plot(X, Y, 'x')
plt.savefig(args.output_figurename)


It outputs a graph:

There are other such non-linear mapping algorithms, such as t-SNE (t-distributed stochastic neighbor embedding) and Kohonen’s mapping (SOM, self-organizing map).

The topic of word embedding algorithms has been one of the interests of this blog, as in this entry, with Word2Vec [Mikilov et. al. 2013] as one of the main examples. It is a great tool for text mining, (for example, see [Czerny 2015],) as it reduces the dimensions needed (compared to bag-of-words model). As an algorithm borrowed from computer vision, a lot of these algorithms use deep learning methods to train the model, while it was not exactly sure why it works. Despite that, there are many articles talking about how to train the model. [Goldberg & Levy 2014, Rong 2014 etc.] Addition and subtraction of the word vectors show amazing relationships that carry semantic values, as I have shown in my previous blog entry. [Ho 2015]

However, Tomas Mikolov is no longer working in Google, making the development of this algorithm discontinued. As a follow-up of their work, Stanford NLP group later proposed a model, called GloVe (Global Vectors), that embeds word vectors using probabilistic methods. [Pennington, Socher & Manning 2014] It can be implemented in the package glove-python in python, and text2vec in R (or see their CRAN post).  Their paper is neatly written, and a highly recommended read.

To explain the theory of GloVe, we start with some basic probabilistic picture in basic natural language processing (NLP). We suppose the relation between the words occur in certain text windows within a corpus, but the details are not important here. Assume that $i$, $j$, and $k$ are three words, and the conditional probability $P_{ik}$ is defined as

$P_{ij} = P(j | i) = \frac{X_{ij}}{X_i}$,

where $X$‘s are the counts, and similarly for $P_{jk}$. And we are interested in the following ratio:

$F(w_i, w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$.

The tilde means “context,” but we will later assume it is also a word. Citing the example from their paper, take $i$ as ice, and $j$ as steam. if $k$ is solid, then the ratio is expected to be large; or if $k$ is gas, then it is expected to be low. But if $k$ is water, which are related to both, or fashion, which is related to none, then the ratio is expected to be approximately 1.

And the addition and subtraction in Word2Vec is similar to this. We want the ratio to be like the subtraction as in Word2Vec (and multiplication as in addition), then we should modify the function $F$ such that,

$F(w_i - w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$.

On the other hand, the input arguments of $F$ are vectors, but the output is a scalar. We avoid the issue by making the input argument as a dot product,

$F( (w_i - w_j)^T \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$.

In NLP, the word-word co-occurrence matrices are symmetric, and our function $F$ should also be invariant under switching the labeling. We first require $F$ is be a homomorphism,

$F((w_i - w_j)^T \tilde{w}_k) = \frac{F(w_i^T \tilde{w}_k) }{ F(w_j^T \tilde{w}_k)}$,

where we define,

$F(w_i^T \tilde{w}_k) = P_{ik} = \frac{X_{ik}}{X_i}$.

It is clear that $F$ is an exponential function, but to ensure symmetry, we further define:

$w_i^T \tilde{w}_k + b_i + \tilde{b}_k = \log X_{ik}$.

As a result of this equation, the authors defined the following cost function to optimize for GloVe model:

$J = \sum_{i, j=1}^V f(X_{ij}) \left( w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ik} \right)^2$,

where $w_j$, $\tilde{w}_j$, $b_i$, and $\tilde{b}_j$ are parameters to learn. $f(x)$ is a weighting function. (Refer the details to the paper.) [Pennington, Socher & Manning 2014]

As Radim Řehůřek said in his blog entry, [Řehůřek 2014] it is a neat paper, but their evaluation is crappy.

This theory explained why certain similar relations can be achieved, such as Paris – France is roughly equal to Beijing – China, as both can be transformed to the ratio in the definition of $F$ above.

It is a neat paper, as it employs optimization theory and probability theory, without any dark box deep learning.

Deep learning, a collection of related neural network algorithms, has been proved successful in certain types of machine learning tasks in computer vision, speech recognition, data cleaning, and natural language processing (NLP). [Mikolov et. al. 2013] However, it was unclear how deep learning can be so successful. It looks like a black box with messy inputs and excellent outputs. So why is it so successful?

A friend of mine showed me this article in the preprint (arXiv:1410.3831) [Mehta & Schwab 2014] last year, which mathematically shows the equivalence of deep learning and renormalization group (RG). RG is a concept in theoretical physics that has been widely applied in different problems, including critical phenomena, self-organized criticality, particle physics, polymer physics, and strongly correlated electronic systems. And now, Mehta and Schwab showed that an explanation to the performance of deep learning is available through RG.

So what is RG? Before RG, Leo Kadanoff, a physics professor in University of Chicago, proposed an idea of coarse-graining in studying many-body problems in 1966. [Kadanoff 1966] In 1972, Kenneth Wilson and Michael Fisher succeeded in applying ɛ-expansion in perturbative RG to explain the critical exponents in Landau-Ginzburg-Wilson (LGW) Hamiltonian. [Wilson & Fisher 1972] This work has been the standard material of graduate physics courses. In 1974, Kenneth Wilson applied RG to explain the Kondo problem, which led to his Nobel Prize in Physics in 1982. [Wilson 1983]

RG assumes a system of scale invariance, which means the system are similar in whatever scale you are seeing. One example is the chaotic system as in Fig. 1. The system looks the same when you zoom in. We call this scale-invariant system self-similar. And physical systems closed to phase transition are self-similar. And if it is self-similar, Kadanoff’s idea of coarse-graining is then applicable, as in Fig. 2. Four spins can be viewed as one spin that “summarizes” the four spins in that block without changing the description of the physical system. This is somewhat like we “zoom out” the picture on Photoshop or Web Browser.

[Taken from [Singh 2014]]

So what’s the point of zooming out? Physicists care about the Helmholtz free energies of physical systems, which are similar to cost functions to the computer scientists and machine learning specialists. Both are to be minimized. However, whatever scale we are viewing at, the energy of the system should be scale-invariant. Therefore, as we zoom out, the system “changes” yet “looks the same” due to self-similarity, but the energy stays the same. The form of the model is unchanged, but the parameters change as the scale changes.

This is important, because this process tells us which parameters are relevant, and which others are irrelevant. Why? Think of it this way: we have an awesome computer to simulate a glass of water that contains 1023 water molecules. To describe the systems, you have all parameters, including the position of molecules, strength of Van der Waals force, orbital angular momentum of each atom, strength of the covalent bonds, velocities of the molecules… You might have 1025 parameters. However, this awesome computer cannot handle such a system with so many parameters. Then you try to coarse-grain the system, and you discard some parameters in each step of coarse-graining. After numerous steps, it turns out that the temperature and the pressure are the only relevant parameters.

RG helps you identify the relevant parameters.

And it is exactly what happened in deep learning. In each convolutional cycle, features that are not important are gradually discarded, and those that are important are kept and enhanced. Indeed, in computer vision and NLP, the data are so noisy that there are a lot of unnecessary information. Deep learning gradually discards these information. As Mehta and Schwab stated, [Mehta & Schwab 2014]

Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.

So what is the point of understanding this? Unlike other machine algorithms, we did not know how it works, which sometimes makes model building very difficult because we have no idea how to adjust parameters. I believe understanding its equivalence to RG helps guide us to build a model that works.

Charles Martin also wrote a blog entry with more demonstration about the equivalence of deep learning and RG. [Martin 2015]