New Family of Regularization Methods

In their paper, Kawaguchi, Kaelbling, and Bengio explored the theory of why generalization in deep learning is so good. Based on their theoretical insights, they proposed a new regularization method, called Directly Approximately Regularizing Complexity (DARC), in addition to commonly used Lp-regularization and dropout methods.

This paper explains why deep learning can generalize well, despite large capacity and possible algorithmic instability, nonrobustness, and sharp minima, effectively addressing an open problem in the literature. Based on our theoretical insight, this paper also proposes a family of new regularization methods. Its simplest member was empirically shown to improve base models and achieve state-of-the-art performance on MNIST and CIFAR-10 benchmarks. Moreover, this paper presents both data-dependent and data-independent generalization guarantees with improved convergence rates. Our results suggest several new open areas of research.

Screen Shot 2017-10-24 at 11.41.41 PM

Continue reading “New Family of Regularization Methods”


Word Mover’s Distance as a Linear Programming Problem

Much about the use of word-embedding models such as Word2Vec and GloVe have been covered. However, how to measure the similarity between phrases or documents? One natural choice is the cosine similarity, as I have toyed with in a previous post. However, it smoothed out the influence of each word. Two years ago, a group in Washington University in St. Louis proposed the Word Mover’s Distance (WMD) in a PMLR paper that captures the relations between words, not simply by distance, but also the “transportation” from one phrase to another conveyed by each word. This Word Mover’s Distance (WMD) can be seen as a special case of Earth Mover’s Distance (EMD), or Wasserstein distance, the one people talked about in Wasserstein GAN. This is better than bag-of-words (BOW) model in a way that the word vectors capture the semantic similarities between words.

Word Mover’s Distance (WMD)

The formulation of WMD is beautiful. Consider the embedded word vectors \mathbf{X} \in R^{d \times n}, where d is the dimension of the embeddings, and n is the number of words. For each phrase, there is a normalized BOW vector d \in R^n, and d_i = \frac{c_i}{\sum_i c_i}, where i‘s denote the word tokens. The distance between words are the Euclidean distance of their embedded word vectors, denoted by c(i, j) = || \mathbf{x}_i - \mathbf{x}_j ||_2, where i and j denote word tokens. The document distance, which is WMD here, is defined by \sum_{i, j} \mathbf{T}_{i j} c(i, j), where \mathbf{T} is a n \times n matrix. Each element \mathbf{T}_{ij} \geq 0 denote how nuch of word i in the first document (denoted by \mathbf{d}) travels to word j in the new document (denoted by \mathbf{d}').

Then the problem becomes the minimization of the document distance, or the WMD, and is formulated as:

\text{min}_{\mathbf{T} \geq 0} \sum_{i, j=1}^n \mathbf{T}_{ij} c(i, j),

given the constraints:

\sum_{j=1}^n \mathbf{T}_{ij} = d_i, and

\sum_{i=1}^n \mathbf{T}_{ij} = d_j'.

This is essentially a simplified case of the Earth Mover’s distance (EMD), or the Wasserstein distance. (See the review by Gibbs and Su.)

Using PuLP

The WMD is essentially a linear optimization problem. There are many optimization packages on the market, and my stance is that, for those common ones, there are no packages that are superior than others. In my job, I happened to handle a missing data problem, in turn becoming a non-linear optimization problem with linear constraints, and I chose limSolve, after I shop around. But I actually like a lot of other packages too. For WMD problem, I first tried out cvxopt first, which should actually solve the exact same problem, but the indexing is hard to maintain. Because I am dealing with words, it is good to have a direct hash map, or a dictionary. I can use the Dictionary class in gensim. But I later found out I should use PuLP, as it allows indices with words as a hash map (dict in Python), and WMD is a linear programming problem, making PuLP is a perfect choice, considering code efficiency.

An example of using PuLP can be demonstrated by the British 1997 UG Exam, as in the first problem of this link, with the Jupyter Notebook demonstrating this.

Implementation of WMD using PuLP

The demonstration can be found in the Jupyter Notebook.

Load the necessary packages:

from itertools import product
from collections import defaultdict

import numpy as np
from scipy.spatial.distance import euclidean
import pulp
import gensim

Then define the functions the gives the BOW document vectors:

def tokens_to_fracdict(tokens):
    cntdict = defaultdict(lambda : 0)
    for token in tokens:
        cntdict[token] += 1
    totalcnt = sum(cntdict.values())
    return {token: float(cnt)/totalcnt for token, cnt in cntdict.items()}

Then implement the core calculation. Note that PuLP is actually a symbolic computing package. This function return a pulp.LpProblem class:

def word_mover_distance_probspec(first_sent_tokens, second_sent_tokens, wvmodel, lpFile=None):
    all_tokens = list(set(first_sent_tokens+second_sent_tokens))
    wordvecs = {token: wvmodel[token] for token in all_tokens}

    first_sent_buckets = tokens_to_fracdict(first_sent_tokens)
    second_sent_buckets = tokens_to_fracdict(second_sent_tokens)

    T = pulp.LpVariable.dicts('T_matrix', list(product(all_tokens, all_tokens)), lowBound=0)

    prob = pulp.LpProblem('WMD', sense=pulp.LpMinimize)
    prob += pulp.lpSum([T[token1, token2]*euclidean(wordvecs[token1], wordvecs[token2])
                        for token1, token2 in product(all_tokens, all_tokens)])
    for token2 in second_sent_buckets:
        prob += pulp.lpSum([T[token1, token2] for token1 in first_sent_buckets])==second_sent_buckets[token2]
    for token1 in first_sent_buckets:
        prob += pulp.lpSum([T[token1, token2] for token2 in second_sent_buckets])==first_sent_buckets[token1]

    if lpFile!=None:


    return prob

To extract the value, just run pulp.value(prob.objective)

We use Google Word2Vec. Refer the \mathbf{T} matrices in the Jupyter Notebook. Running this by a few examples:

  1. document1 = President, talk, Chicago
    document2 = President, speech, Illinois
    WMD = 2.88587622936
  2. document1 = physician, assistant
    document2 = doctor
    WMD = 2.8760048151
  3. document1 = physician, assistant
    document2 = doctor, assistant
    WMD = 1.00465738773
    (compare with example 2!)
  4. document1 = doctors, assistant
    document2 = doctor, assistant
    WMD = 1.02825379372
    (compare with example 3!)
  5. document1 = doctor, assistant
    document2 = doctor, assistant
    WMD = 0.0
    (totally identical; compare with example 3!)

There are more examples in the notebook.


WMD is a good metric comparing two documents or sentences, by capturing the semantic meanings of the words. It is more powerful than BOW model as it captures the meaning similarities; it is more powerful than the cosine distance between average word vectors, as the transfer of meaning using words from one document to another is considered. But it is not immune to the problem of misspelling.

This algorithm works well for short texts. However, when the documents become large, this formulation will be computationally expensive. The author actually suggested a few modifications, such as the removal of constraints, and word centroid distances.

Example codes can be found in my Github repository: stephenhky/PyWMD.

Continue reading “Word Mover’s Distance as a Linear Programming Problem”

“selu” Activation Function and 93 Pages of Appendix

A preprint on arXiv recently caught a lot of attentions. While deep learning is successful in various types of neural networks, it had not been so for feed-forward neural networks. The authors of this paper proposed normalizing the network with a new activation function, called “selu” (scaled exponential linear units):

\text{selu}(x) =\lambda \left\{ \begin{array}{cc} x & \text{if } x>0  \\ \alpha e^x - \alpha & \text{if } x \leq 0  \end{array}  \right..

which is an improvement to the existing “elu” function.

Despite this achievement, what caught the eyeballs is not the activation function, but the 93-page appendix of mathematical proof:


And this is one of the pages in the appendix:


Some scholars teased at it on Twitter too:

Continue reading ““selu” Activation Function and 93 Pages of Appendix”

On Wasserstein GAN

A few weeks ago, I introduced the generative model called generative adversarial networks (GAN), and stated the difficulties of training it. Not long after the post, a group of scientists from Facebook and Courant introduced Wasserstein GAN, which uses Wasserstein distance, or the Earth Mover (EM) distance, instead of Jensen-Shannon (JS) divergence as the final cost function.

In their paper (arXiv:1701.07875), they proposed the following to improve GAN:

  • do not use sigmoid function as the activation function;
  • the cost functions of the discriminator and generator must not have logarithms;
  • cap the parameters at each step of training; and
  • do not use momentum-based optimizer such as momentum or Adam; instead, RMSprop or SGD are recommended.

These do not have a theoretical reason, but rather empirical. However, the most important change is to use the Wasserstein distance as the cost function, which the authors explained in detail. There are many metrics to measure the differences between probability distributions, as summarized by Gibbs and Su in the paper, (arXiv:math/0209021) the authors of Wasserstein GAN discussed four of them, namely, the total variation (TV) distance, the Kullback-Leibler (KL) divergence, the Jensen-Shannon (JS) divergence, and the Earth-Mover (EM, Wasserstein) distance. They used an example of two parallel uniform distributions in a line to illustrate that only the EM (or Wasserstein) distance captured the continuity of the distribution distances, solving the problem of zero-change when the derivative of the generator becoming too small. The EM distance indicates, intuitively, how much “mass” must be transported from one distribution to another.

Formally, the EM distance is

W(\mathbb{P}_r, \mathbb{P}_g) = \inf_{\gamma \in (\mathbb{P}_r, \mathbb{P}_g)} \mathbb{E}_{(x, y) \sim \gamma} [ || x-y||],

and the training involves finding the optimal transport path \gamma.

Unfortunately, the EM distance cannot be computed directly from the definition. However, as an optimization problem, there is a dual problem corresponding to this. While the author did not explain too much, Vincent Herrmann explained about the dual problem in detail in his blog.

The algorithm is described as follow:


Continue reading “On Wasserstein GAN”

Generative Adversarial Networks

Recently I have been drawn to generative models, such as LDA (latent Dirichlet allocation) and other topic models. In deep learning, there are a few examples, such as FVBN (fully visible belief networks), VAE (variational autoencoder), RBM (restricted Boltzmann machine) etc. Recently I have been reading about GAN (generative adversarial networks), first published by Ian Goodfellow and his colleagues and collaborators. Goodfellow published his talk in NIPS 2016 on arXiv recently.

Yesterday I attended an event at George Mason University organized by Data Science DC Meetup Group. Jennifer Sleeman talked about GAN. It was a very good talk.

In GAN, there are two important functions, namely, the discriminator (D), and the generator (G). As a generative model, the distribution of training data, all labeled positive, can be thought of the distribution that the generator was trained to produce. The discriminator discriminates the data with positive labels and those with negative labels. Then the generator tries to generate data, probably from noises, which should be negative, to fake the discriminator to see it as positive. This process repeats iteratively, and eventually the generator is trained to produce data that are close to the distribution of the training data, and the discriminator will be confused to classify the generated data as positive with probability \frac{1}{2}. The intuition of this competitive game is from minimax game in game theory. The formal algorithm is described in the original paper as follow:


The original paper discussed about that the distribution of final generated data identical to that of the training data being the optimal for the model, and argued using the Jensen-Shannon (JS) divergence. Ferenc Huszár discussed in his blog about the relations between maximum likelihood, Kullback-Leibler (KL) divergence, and Jensen-Shannon (JS) divergence.

I have asked the speaker a few questions about the concepts of GAN as well.

GAN is not yet a very sophisticated framework, but it already found a few industrial use. Some of its descendants include LapGAN (Laplacian GAN), and DCGAN (deep convolutional GAN). Applications include voice generation, image super-resolution, pix2pix (image-to-image translation), text-to-image synthesis, iGAN (interactive GAN) etc.

Adversarial training is the coolest thing since sliced bread.” – Yann LeCun



Continue reading “Generative Adversarial Networks”

Author-Topic Models in gensim

Recently, gensim, a Python package for topic modeling, released a new version of its package which includes the implementation of author-topic models.

The most famous topic model is undoubtedly latent Dirichlet allocation (LDA), as proposed by David Blei and his colleagues. Such a topic model is a generative model, described by the following directed graphical models:


In the graph, \alpha and \beta are hyperparameters. \theta is the topic distribution of a document, z is the topic for each word in each document, \phi is the word distributions for each topic, and w is the generated word for a place in a document.

There are models similar to LDA, such as correlated topic models (CTM), where \phi is generated by not only \beta but also a covariance matrix \Sigma.

There exists an author model, which is a simpler topic model. The difference is that the words in the document are generated from the author for each document, as in the following graphical model. x is the author of a given word in the document.


Combining these two, it gives the author-topic model as a hybrid, as shown below:


The new release of Python package, gensim, supported the author-topic model, as demonstrated in this Jupyter Notebook.


  • I am also aware that there is another topic model called structural topic model (STM), developed for the field of social science. However, there is no Python package supporting this, but an R package, called stm, is available for it. You can refer to their homepage too.
  • I may consider including author-topic model and STM in the next release of the Python package shorttext.

Continue reading “Author-Topic Models in gensim”

Tensor Networks and Density Matrix Renormalization Group

A while ago, Mehta and Schwab drew a connection between Restricted Boltzmann Machine (RBM), a type of deep learning algorithm, and renormalization group (RG), a theoretical tool in physics applied on critical phenomena. [Mehta & Schwab, 2014; see previous entry] Can RG be able to relate to other deep leaning algorithms?

Schwab wrote a paper on a new machine learning algorithm that directly exploit a type of RG in physics: the density matrix renormalization group (DMRG). DMRG is used in condensed matter physics for low-dimensional (d=1 or 2) lattice systems. DMRG was invented by Steve White, using diagonalization of reduced density matrices on each site. [White 1992] However, now it was performed using singular value decomposition for each successive pair of lattice sites.

DMRG is related to quantum entanglement, which is a two-site quantum system, and the entanglement can be characterized by any of its reduced density matrix. However, DMRG deals with reduced density matrix of all sites. Traditionally, this kind of many body systems can be represented by the kets:

|\Psi \rangle = \sum_{\sigma_1 \ldots \sigma_L} c^{\sigma_1} \ldots c^{\sigma_L} |\sigma_1 \ldots \sigma_L \rangle.

These c‘s are c-numbers. To describe the entanglement of these states but to remain numerically convenient, it is desirable to convert these c-numbers into matrices: [Schollwöck 2013]

c^{\sigma_1} \ldots c^{\sigma_L} \rightarrow M^{\sigma_1} \ldots M^{\sigma_L}.

And these are tensor networks. DMRG aims at finding a good description of the states with these tensor networks. These tensor networks have nice graphical representation, as in the appendix of the paper by Stoudenmire and Schwab. The training is also described in their paper elegantly using these tensor network diagrams. Their new algorithm proves to be a good new machine learning algorithm, probably fit for small data but complicated features. This is a direct application of real-space RG in machine learning algorithm. Stoudenmire wrote in Quora about the value of this work:

“In our work… we reached state-of-the-art accuracy for the MNIST dataset without needing extra techniques such as convolutional layers. One exciting aspect of these proposals is that their cost scales at most linearly in the number of training examples, versus quadratically for most kernel methods. Representing parameters by a tensor network gives them a structure that can be analyzed to better understand the model and what it has learned. Also tensor network optimization methods are adaptive, automatically selecting the minimum number of parameters necessary for the optimal solution within a certain tensor network class.” – Miles Stoudenmire, in Quora

There are some extension algorithms from DMRG, such as multiscale entanglement renormalization ansatz (MERA), developed by Vidal and his colleagues. [Vidal 2008]


Steve R. White (adapted from his faculty homepage)


Tensor Diagram of the Training of this New Algorithm. (Take from arXiv:1605.05775)

Continue reading “Tensor Networks and Density Matrix Renormalization Group”

Sammon Embedding

Word embedding has been a frequent theme of this blog. But the original embedding has been algorithms that perform a non-linear mapping of higher dimensional data to the lower one. This entry I will talk about one of the most oldest and widely used one: Sammon Embedding, published in 1969. This is an embedding algorithm that preserves the distances between all points. How is it achieved?

Assume there are high dimensional data described by d-dimensional vectors, X_i where i=1, 2, \ldots, N. And they will be mapped into vectors Y_i, with dimensions 2 or 3. Denote the distances to be d_{ij}^{*} = \sqrt{| X_i - X_j|^2} and d_{ij} = \sqrt{| Y_i - Y_j|^2}. In this problem, Y_i are the variables to be learned. The cost function to minimize is

E = \frac{1}{c} \sum_{i<j} \frac{(d_{ij}^{*} - d_{ij})^2}{d_{ij}^{*}},

where c = \sum_{i<j} d_{ij}^{*}. To minimize this, use Newton's method by

Y_{pq} (m+1) = Y_{pq} (m) - \alpha \Delta_{pq} (m),

where \Delta_{pq} (m) = \frac{\partial E(m)}{\partial Y_{pq}(m)} / \left|\frac{\partial^2 E(m)}{\partial Y_{pq} (m)^2} \right|, and \alpha is the learning rate.

To implement it, use Theano package of Python to define the cost function for the sake of optimization, and then implement the learning with numpy. Define the cost function with the outline above:

import theano
import theano.tensor as T

import numerical_gradients as ng

# define variables
mf = T.dscalar('mf')         # magic factor / learning rate

# coordinate variables
Xmatrix = T.dmatrix('Xmatrix')
Ymatrix = T.dmatrix('Ymatrix')

# number of points and dimensions (user specify them)
N, d = Xmatrix.shape
_, td = Ymatrix.shape

# grid indices
n_grid = T.mgrid[0:N, 0:N]
ni = n_grid[0].flatten()
nj = n_grid[1].flatten()

# cost function
c_terms, _ = theano.scan(lambda i, j: T.switch(, j),
                         sequences=[ni, nj])
c = T.sum(c_terms)

s_term, _ = theano.scan(lambda i, j: T.switch(, j),
                        sequences=[ni, nj])
s = T.sum(s_term)

E = s / c

# function compilation and optimization
Efcn = theano.function([Xmatrix, Ymatrix], E)

And implement the update algorithm with the following function:

import numpy

# training
def sammon_embedding(Xmat, initYmat, alpha=0.3, tol=1e-8, maxsteps=500, return_updates=False):
    N, d = Xmat.shape
    NY, td = initYmat.shape
    if N != NY:
        raise ValueError('Number of vectors in Ymat ('+str(NY)+') is not the same as Xmat ('+str(N)+')!')

    # iteration
    Efcn_X = lambda Ymat: Efcn(Xmat, Ymat)
    step = 0
    oldYmat = initYmat
    oldE = Efcn_X(initYmat)
    update_info = {'Ymat': [initYmat], 'cost': [oldE]}
    converged = False
    while (not converged) and step<=maxsteps:
        newYmat = oldYmat - alpha*ng.tensor_gradient(Efcn_X, oldYmat, tol=tol)/ng.tensor_divgrad(Efcn_X, oldYmat, tol=tol)
        newE = Efcn_X(newYmat)
        if np.all(np.abs(newE-oldE)<tol):
            converged = True
        oldYmat = newYmat
        oldE = newE
        step += 1
        if return_updates:
            print 'Step ', step, '\tCost = ', oldE

    # return results
    if return_updates:
        update_info['num_steps'] = step
        return oldYmat, update_info
        return oldYmat

The above code is stored in the file We can test the algorithm with an example. Remember tetrahedron, a three-dimensional object with four points equidistant from one another. We expect the embedding will reflect this by sampling points around these four points. With the code, we implemented it this way:

import argparse

import numpy as np
import matplotlib.pyplot as plt

import sammon as sn

argparser = argparse.ArgumentParser('Embedding points around tetrahedron.')
                       help='file name of the output plot')

args = argparser.parse_args()

tetrahedron_points = [np.array([0., 0., 0.]),
                      np.array([1., 0., 0.]),
                      np.array([np.cos(np.pi/3), np.sin(np.pi/3), 0.]),
                      np.array([0.5, 0.5/np.sqrt(3), np.sqrt(2./3.)])]

sampled_points = np.concatenate([np.random.multivariate_normal(point, np.eye(3)*0.0001, 10)
                                 for point in tetrahedron_points])

init_points = np.concatenate([np.random.multivariate_normal(point[:2], np.eye(2)*0.0001, 10)
                              for point in tetrahedron_points])

embed_points = sn.sammon_embedding(sampled_points, init_points, tol=1e-4)

X, Y = embed_points.transpose()
plt.plot(X, Y, 'x')

It outputs a graph:


There are other such non-linear mapping algorithms, such as t-SNE (t-distributed stochastic neighbor embedding) and Kohonen’s mapping (SOM, self-organizing map).

Continue reading “Sammon Embedding”

Topological Phases

Michael Kosterlitz, Duncan Haldane, and David J. Thouless are the laureates of Nobel Prize in Physics 2016, “for theoretical discoveries of topological phase transitions and topological phases of matter.” Before Thouless, topology was not known to the physics community. It is a basic knowledge nowadays, however.

I am particularly familiar with Berezinskii-Kosterlitz-Thouless phase transition. What is it? Before that, phase transitions had been studied through the framework of spontaneous symmetry breaking, employing the tools in functional field theory and renormalization group. Matter can be in either disordered state that the symmetry is not broken, or ordered state that a particular continuous symmetry is broken. Near the critical point, many observables found to exhibit long-range order, with C(r) \sim \frac{1}{r}, which are so universal that all physical systems described by the same Landau-Ginzburg-Wilson (LGW) model are found to obey it. But in lower dimensions such as d=2, proved by Mermin, Wagner, and Hohenberg, an ordered state is not stable because of its huge fluctuation.

The absence of an ordered state does not exclude the possibility of a phase transition. Berezinskii in 1971, and Kosterlitz and Thouless in 1973, suggested a phase transition that concerns the proliferation of topological objects. While the correlation must be short-ranged, i.e., C(r) \sim e^{-\frac{r}{\xi}}, a normal description using LGW model in d=2 does not permit that, unless vortices are considered. However, below a certain temperature, due to energy configuration, it is better for these vortices to be bounded. Then the material exhibits quasi-long-range order, i.e., C(r) \sim \frac{1}{r^{\alpha}}. This change in correlation function is a phase transition different from that induced by spontaneous symmetry breaking, but the proliferation of isolated topological solitons.

Their work started the study of topology in condensed matter system. Not long after this work, there was the study of topological defects in various condensed matter system, and then fractional quantum Hall effect, topological insulators, topological quantum computing, A phase in liquid crystals and helimagnets etc. “Topological” is the buzzword in condensed matter physics community nowadays. Recently, there is a preprint article connecting machine learning and topological physical state. (See: arXiv:1609.09060.)

In machine learning, deep learning is the buzzword. However, to understand how these things work, we may need a theory, or we may need to construct our own features if a large amount of data are not available. Then, topological data analysis (TDA) becomes important in the same way as condensed matter physics.

Continue reading “Topological Phases”

Blog at

Up ↑