Google launches her AutoML project last year, in an effort to automate the process of seeking the most appropriate neural net designs for a particular classification problem. Designing neural networks have been time consuming, despite the use of TensorFlow / Keras or other deep learning architecture nowadays. Therefore, the Google Brain team devised the Neural Architecture Search (NAS) using a recurrent neural network to perform reinforcement learning. (See their blog entry.) It is used to find the neural networks for image classifiers. (See their blog entry.)

Apparently, with a state-of-the-art hardware, it is of Google’s advantage to perform such an experiment on the CIFAR-10 dataset using 450 GPUs for 3-4 days. But this makes the work inaccessible for small companies or personal computers.

Then it comes an improvement to NAS: the Efficient Neural Architecture Search via Parameter Sharing (ENAS), which is a much more efficient method to search for a neural networks, by narrowing down the search in a subgraph. It reduces the need of GPUs.

While I do not think it is a threat to machine learning engineers, it is a great algorithm to note. It looks to me a brute-force algorithm, but it needs scientists and engineers to gain insights. Still, I believe development of the theory behind neural networks is much needed.

There are many tasks in natural language processing that are challenging. This blog entry is on text summarization, which briefly summarizes the survey article on this topic. (arXiv:1707.02268) The authors of the article defined the task to be

Automatic text summarization is the task of producing a concise and fluent summary while preserving key information content and overall meaning.

There are basically two approaches to this task:

• extractive summarization: identifying important sections of the text, and extracting them; and
• abstractive summarization: producing summary text in a new way.

Most algorithmic methods developed are of the extractive type, while most human writers summarize using abstractive approach. There are many methods in extractive approach, such as identifying given keywords, identifying sentences similar to the title, or wrangling the text at the beginning of the documents.

How do we instruct the machines to perform extractive summarization? The authors mentioned about two representations: topic and indicator. In topic representations, frequencies, tf-idf, latent semantic indexing (LSI), or topic models (such as latent Dirichlet allocation, LDA) are used. However, simply extracting these sentences out with these algorithms may not generate a readable summary. Employment of knowledge bases or considering contexts (from web search, e-mail conversation threads, scientific articles, author styles etc.) are useful.

In indicator representation, the authors mentioned the graph methods, inspired by PageRank. (see this) “Sentences form vertices of the graph and edges between the sentences indicate how similar the two sentences are.” And the key sentences are identified with ranking algorithms. Of course, machine learning methods can be used too.

Evaluation on the performance on text summarization is difficult. Human evaluation is unavoidable, but with manual approaches, some statistics can be calculated, such as ROUGE.

A while ago, Mehta and Schwab drew a connection between Restricted Boltzmann Machine (RBM), a type of deep learning algorithm, and renormalization group (RG), a theoretical tool in physics applied on critical phenomena. [Mehta & Schwab, 2014; see previous entry] Can RG be able to relate to other deep leaning algorithms?

Schwab wrote a paper on a new machine learning algorithm that directly exploit a type of RG in physics: the density matrix renormalization group (DMRG). DMRG is used in condensed matter physics for low-dimensional (d=1 or 2) lattice systems. DMRG was invented by Steve White, using diagonalization of reduced density matrices on each site. [White 1992] However, now it was performed using singular value decomposition for each successive pair of lattice sites.

DMRG is related to quantum entanglement, which is a two-site quantum system, and the entanglement can be characterized by any of its reduced density matrix. However, DMRG deals with reduced density matrix of all sites. Traditionally, this kind of many body systems can be represented by the kets:

$|\Psi \rangle = \sum_{\sigma_1 \ldots \sigma_L} c^{\sigma_1} \ldots c^{\sigma_L} |\sigma_1 \ldots \sigma_L \rangle$.

These c‘s are c-numbers. To describe the entanglement of these states but to remain numerically convenient, it is desirable to convert these c-numbers into matrices: [Schollwöck 2013]

$c^{\sigma_1} \ldots c^{\sigma_L} \rightarrow M^{\sigma_1} \ldots M^{\sigma_L}$.

And these are tensor networks. DMRG aims at finding a good description of the states with these tensor networks. These tensor networks have nice graphical representation, as in the appendix of the paper by Stoudenmire and Schwab. The training is also described in their paper elegantly using these tensor network diagrams. Their new algorithm proves to be a good new machine learning algorithm, probably fit for small data but complicated features. This is a direct application of real-space RG in machine learning algorithm. Stoudenmire wrote in Quora about the value of this work:

“In our work… we reached state-of-the-art accuracy for the MNIST dataset without needing extra techniques such as convolutional layers. One exciting aspect of these proposals is that their cost scales at most linearly in the number of training examples, versus quadratically for most kernel methods. Representing parameters by a tensor network gives them a structure that can be analyzed to better understand the model and what it has learned. Also tensor network optimization methods are adaptive, automatically selecting the minimum number of parameters necessary for the optimal solution within a certain tensor network class.” – Miles Stoudenmire, in Quora

There are some extension algorithms from DMRG, such as multiscale entanglement renormalization ansatz (MERA), developed by Vidal and his colleagues. [Vidal 2008]

Steve R. White (adapted from his faculty homepage)

Tensor Diagram of the Training of this New Algorithm. (Take from arXiv:1605.05775)

There are situations that we deal with short text, probably messy, without a lot of training data. In that case, we need external semantic information. Instead of using the conventional bag-of-words (BOW) model, we should employ word-embedding models, such as Word2Vec, GloVe etc.

Suppose we want to perform supervised learning, with three subjects, described by the following Python dictionary:

classdict={'mathematics': ['linear algebra',
'topology',
'algebra',
'calculus',
'variational calculus',
'functional field',
'real analysis',
'complex analysis',
'differential equation',
'statistics',
'statistical optimization',
'probability',
'stochastic calculus',
'numerical analysis',
'differential geometry'],
'physics': ['renormalization',
'classical mechanics',
'quantum mechanics',
'statistical mechanics',
'functional field',
'path integral',
'quantum field theory',
'electrodynamics',
'condensed matter',
'particle physics',
'topological solitons',
'astrophysics',
'spontaneous symmetry breaking',
'atomic molecular and optical physics',
'quantum chaos'],
'theology': ['divine providence',
'soteriology',
'anthropology',
'pneumatology',
'Christology',
'Holy Trinity',
'eschatology',
'scripture',
'ecclesiology',
'predestination',
'divine degree',
'creedal confessionalism',
'scholasticism',
'prayer',
'eucharist']}


And we implemented Word2Vec here. To add external information, we use a pre-trained Word2Vec model from Google, downloaded here. We can use it with Python package gensim. To load it, enter

from gensim.models import Word2Vec


How do we represent a phrase in Word2Vec? How do we do the classification? Here I wrote two classes to do it.

#### Average

We can represent a sentence by summing the word-embedding representations of each word. The class, inside SumWord2VecClassification.py, is coded as follow:

from collections import defaultdict

import numpy as np
from nltk import word_tokenize
from scipy.spatial.distance import cosine

from utils import ModelNotTrainedException

class SumEmbeddedVecClassifier:
def __init__(self, wvmodel, classdict, vecsize=300):
self.wvmodel = wvmodel
self.classdict = classdict
self.vecsize = vecsize
self.trained = False

def train(self):
for classtype in self.classdict:
for shorttext in self.classdict[classtype]:
self.trained = True

def shorttext_to_embedvec(self, shorttext):
vec = np.zeros(self.vecsize)
tokens = word_tokenize(shorttext)
for token in tokens:
if token in self.wvmodel:
vec += self.wvmodel[token]
norm = np.linalg.norm(vec)
if norm!=0:
vec /= np.linalg.norm(vec)
return vec

def score(self, shorttext):
if not self.trained:
raise ModelNotTrainedException()
vec = self.shorttext_to_embedvec(shorttext)
scoredict = {}
try:
scoredict[classtype] = 1 - cosine(vec, self.addvec[classtype])
except ValueError:
scoredict[classtype] = np.nan
return scoredict


Here the exception ModelNotTrainedException is just an exception raised if the model has not been trained yet, but scoring function was called by the user. (Codes listed in my Github repository.) The similarity will be calculated by cosine similarity.

Such an implementation is easy to understand and carry out. It is good enough for a lot of application. However, it has the problem that it does not take the relation between words or word order into account.

#### Convolutional Neural Network

To tackle the problem of word relations, we have to use deeper neural networks. Yoon Kim published a well cited paper regarding this in EMNLP in 2014, titled “Convolutional Neural Networks for Sentence Classification.” The model architecture is as follow: (taken from his paper)

Each word is represented by an embedded vector, but neighboring words are related through the convolutional matrix. And MaxPooling and a dense neural network were implemented afterwards. His paper involves multiple filters with variable window sizes / spatial extent, but for our cases of short phrases, I just use one window of size 2 (similar to dealing with bigram). While Kim implemented using Theano (see his Github repository), I implemented using keras with Theano backend. The codes, inside CNNEmbedVecClassification.py, are as follow:

import numpy as np
from keras.layers import Convolution1D, MaxPooling1D, Flatten, Dense
from keras.models import Sequential
from nltk import word_tokenize

from utils import ModelNotTrainedException

class CNNEmbeddedVecClassifier:
def __init__(self,
wvmodel,
classdict,
n_gram,
vecsize=300,
nb_filters=1200,
maxlen=15):
self.wvmodel = wvmodel
self.classdict = classdict
self.n_gram = n_gram
self.vecsize = vecsize
self.nb_filters = nb_filters
self.maxlen = maxlen
self.trained = False

def convert_trainingdata_matrix(self):
classlabels = self.classdict.keys()
lblidx_dict = dict(zip(classlabels, range(len(classlabels))))

# tokenize the words, and determine the word length
phrases = []
indices = []
for label in classlabels:
for shorttext in self.classdict[label]:
category_bucket = [0]*len(classlabels)
category_bucket[lblidx_dict[label]] = 1
indices.append(category_bucket)
phrases.append(word_tokenize(shorttext))

# store embedded vectors
train_embedvec = np.zeros(shape=(len(phrases), self.maxlen, self.vecsize))
for i in range(len(phrases)):
for j in range(min(self.maxlen, len(phrases[i]))):
train_embedvec[i, j] = self.word_to_embedvec(phrases[i][j])
indices = np.array(indices, dtype=np.int)

return classlabels, train_embedvec, indices

def train(self):
# convert classdict to training input vectors
self.classlabels, train_embedvec, indices = self.convert_trainingdata_matrix()

# build the deep neural network model
model = Sequential()
filter_length=self.n_gram,
border_mode='valid',
activation='relu',
input_shape=(self.maxlen, self.vecsize)))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

# train the model
model.fit(train_embedvec, indices)

# flag switch
self.model = model
self.trained = True

def word_to_embedvec(self, word):
return self.wvmodel[word] if word in self.wvmodel else np.zeros(self.vecsize)

def shorttext_to_matrix(self, shorttext):
tokens = word_tokenize(shorttext)
matrix = np.zeros((self.maxlen, self.vecsize))
for i in range(min(self.maxlen, len(tokens))):
matrix[i] = self.word_to_embedvec(tokens[i])
return matrix

def score(self, shorttext):
if not self.trained:
raise ModelNotTrainedException()

# retrieve vector
matrix = np.array([self.shorttext_to_matrix(shorttext)])

# classification using the neural network
predictions = self.model.predict(matrix)

# wrangle output result
scoredict = {}
for idx, classlabel in zip(range(len(self.classlabels)), self.classlabels):
scoredict[classlabel] = predictions[0][idx]
return scoredict


The output is a vector of length equal to the number of class labels, 3 in our example. The elements of the output vector add up to one, indicating its score, and a nature of probability.

#### Evaluation

A simple cross-validation to the example data set does not tell a difference between the two algorithms:

However, we can test the algorithm with a few examples:

Example 1: “renormalization”

• Average: {‘mathematics’: 0.54135105096749336, ‘physics’: 0.63665460856632494, ‘theology’: 0.31014049736087901}
• CNN: {‘mathematics’: 0.093827009201049805, ‘physics’: 0.85451591014862061, ‘theology’: 0.051657050848007202}

As renormalization was a strong word in the training data, it gives an easy result. CNN can distinguish much more clearly.

Example 2: “salvation”

• Average: {‘mathematics’: 0.14939650156482298, ‘physics’: 0.21692765541184023, ‘theology’: 0.5698233329716329}
• CNN: {‘mathematics’: 0.012395491823554039, ‘physics’: 0.022725773975253105, ‘theology’: 0.96487873792648315}

“Salvation” is not found in the training data, but it is closely related to “soteriology,” which means the doctrine of salvation. So it correctly identifies it with theology.

Example 3: “coffee”

• Average: {‘mathematics’: 0.096820211601723272, ‘physics’: 0.081567332119268032, ‘theology’: 0.15962682945135631}
• CNN: {‘mathematics’: 0.27321341633796692, ‘physics’: 0.1950736939907074, ‘theology’: 0.53171288967132568}

Coffee is not related to all subjects. The first architecture correctly indicates the fact, but CNN, with its probabilistic nature, has to roughly equally distribute it (but not so well.)

The code can be found in my Github repository: stephenhky/PyShortTextCategorization. (This repository has been updated since this article was published. The link shows the version of the code when this appeared online.)

At the end of last month, Yahoo opened the sources of training a model to classify not suitable/safe for work (NSFW) images, particularly pornographic images, using convolutional neural network (CNN). It was implemented with Caffe. Users need to supply the training data, positive being the NSFW images, and negative being the suitable/safe for work (SFW) images, to train the model. The model takes an image as the input, and output a score between 0 and 1.

The codes are available on Github, with the README.md about the installation.

Ever since Mehta and Schwab laid out the relationship between restricted Boltzmann machines (RBM) and deep learning mathematically (see my previous entry), scientists have been discussing why deep learning works so well. Recently, Henry Lin and Max Tegmark put a preprint on arXiv (arXiv:1609.09225), arguing that deep learning works because it captures a few essential physical laws and properties. Tegmark is a cosmologist.

Physical laws are simple in a way that a few properties, such as locality, symmetry, hierarchy etc., lead to large-scale, universal, and often complex phenomena. A lot of machine learning algorithms, including deep learning algorithms, have deep relations with formalisms outlined in statistical mechanics.

A lot of machine learning algorithms are basically probability theory. They outlined a few types of algorithms that seek various types of probabilities. They related the probabilities to Hamiltonians in many-body systems.

They argued why neural networks can approximate functions (polynomials) so well, giving a simple neural network performing multiplication. With central limit theorem or Jaynes’ arguments (see my previous entry), a lot of multiplications, they said, can be approximated by low-order polynomial Hamiltonian. This is like a lot of many-body systems that can be approximated by 4-th order Landau-Ginzburg-Wilson (LGW) functional.

Properties such as locality reduces the number of hyper-parameters needed because it restricts to interactions among close proximities. Symmetry further reduces it, and also computational complexities. Symmetry and second order phase transition make scaling hypothesis possible, leading to the use of the tools such as renormalization group (RG). As many people have been arguing, deep learning resembles RG because it filters out unnecessary information and maps out the crucial features. Tegmark use classifying cats vs. dogs as an example, as in retrieving temperatures of a many-body systems using RG procedure. They gave a counter-example to Schwab’s paper with the probabilities cannot be preserved by RG procedure, but while it is sound, but it is not the point of the RG procedure anyway.

They also discussed about the no-flattening theorems for neural networks.

There are many tasks that involve coding, for example, putting kids into groups according to their age, labeling the webpages about their kinds, or putting students in Hogwarts into four colleges… And researchers or lawyers need to code people, according to their filled-in information, into occupations. Melissa Friesen, an investigator in Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NCI), National Institutes of Health (NIH), saw the need of large-scale coding. Many researchers are dealing with big data concerning epidemiology. She led a research project, in collaboration with Office of Intramural Research (OIR), Center for Information Technology (CIT), National Institutes of Health (NIH), to develop an artificial intelligence system to cope with the problem. This leads to a publicly available tool called SOCcer, an acronym for “Standardized Occupation Coding for Computer-assisted Epidemiological Research.” (URL: http://soccer.nci.nih.gov/soccer/)

The system was initially developed in an attempt to find the correlation between the onset of cancers and other diseases and the occupation. “The application is not intended to replace expert coders, but rather to prioritize which job descriptions would benefit most from expert review,” said Friesen in an interview. She mainly works with Daniel Russ in CIT.

SOCcer takes job title, industry codes (in terms of SIC, Standard Industrial Classification), and job duties, and gives an occupational code called SOC 2010 (Standard Occupational Classification), used by U. S. federal government agencies. The data involves short text, often messy. There are 840 codes in SOC 2010 systems. Conventional natural language processing (NLP) methods may not apply. Friesen, Russ, and Kwan-Yuet (Stephen) Ho (also in OIR, CIT; a CSRA staff) use fuzzy logic, and maximum entropy (maxent) methods, with some feature engineering, to build various classifiers. These classifiers are aggregated together, as in stacked generalization (see my previous entry), using logistic regression, to give a final score.

SOCcer has a companion software, called SOCAssign, for expert coders to prioritize the codings. It was awarded with DCEG Informatics Tool Challenge 2015. SOCcer itself was awarded in 2016. And the SOCcer team was awarded for Scientific Award of Merit by CIT/OCIO in 2016 as well (see this). Their work was published in Occup. Environ. Med.

On August 1, my friends and I attended a meetup host by DC Data Science, titled “Predicting and Understanding Law with Machine Learning.” The speaker was John Nay, a Ph.D. candidate in Vanderbilt University. He presented his research which is at an application of natural language processing on legal enactment documents.

His talk was very interesting, from the similarity of presidents and the chambers, to the kind of topics each party focused on. He used a variety of techniques such as Word2Vec, STM (structural topic modeling), and some common textual and statistical analysis. It is quite a comprehensive study.

His work is demonstrated at predictgov.com. His work can be found in arXiv.

The topic of word embedding algorithms has been one of the interests of this blog, as in this entry, with Word2Vec [Mikilov et. al. 2013] as one of the main examples. It is a great tool for text mining, (for example, see [Czerny 2015],) as it reduces the dimensions needed (compared to bag-of-words model). As an algorithm borrowed from computer vision, a lot of these algorithms use deep learning methods to train the model, while it was not exactly sure why it works. Despite that, there are many articles talking about how to train the model. [Goldberg & Levy 2014, Rong 2014 etc.] Addition and subtraction of the word vectors show amazing relationships that carry semantic values, as I have shown in my previous blog entry. [Ho 2015]

However, Tomas Mikolov is no longer working in Google, making the development of this algorithm discontinued. As a follow-up of their work, Stanford NLP group later proposed a model, called GloVe (Global Vectors), that embeds word vectors using probabilistic methods. [Pennington, Socher & Manning 2014] It can be implemented in the package glove-python in python, and text2vec in R (or see their CRAN post).  Their paper is neatly written, and a highly recommended read.

To explain the theory of GloVe, we start with some basic probabilistic picture in basic natural language processing (NLP). We suppose the relation between the words occur in certain text windows within a corpus, but the details are not important here. Assume that $i$, $j$, and $k$ are three words, and the conditional probability $P_{ik}$ is defined as

$P_{ij} = P(j | i) = \frac{X_{ij}}{X_i}$,

where $X$‘s are the counts, and similarly for $P_{jk}$. And we are interested in the following ratio:

$F(w_i, w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$.

The tilde means “context,” but we will later assume it is also a word. Citing the example from their paper, take $i$ as ice, and $j$ as steam. if $k$ is solid, then the ratio is expected to be large; or if $k$ is gas, then it is expected to be low. But if $k$ is water, which are related to both, or fashion, which is related to none, then the ratio is expected to be approximately 1.

And the addition and subtraction in Word2Vec is similar to this. We want the ratio to be like the subtraction as in Word2Vec (and multiplication as in addition), then we should modify the function $F$ such that,

$F(w_i - w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$.

On the other hand, the input arguments of $F$ are vectors, but the output is a scalar. We avoid the issue by making the input argument as a dot product,

$F( (w_i - w_j)^T \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$.

In NLP, the word-word co-occurrence matrices are symmetric, and our function $F$ should also be invariant under switching the labeling. We first require $F$ is be a homomorphism,

$F((w_i - w_j)^T \tilde{w}_k) = \frac{F(w_i^T \tilde{w}_k) }{ F(w_j^T \tilde{w}_k)}$,

where we define,

$F(w_i^T \tilde{w}_k) = P_{ik} = \frac{X_{ik}}{X_i}$.

It is clear that $F$ is an exponential function, but to ensure symmetry, we further define:

$w_i^T \tilde{w}_k + b_i + \tilde{b}_k = \log X_{ik}$.

As a result of this equation, the authors defined the following cost function to optimize for GloVe model:

$J = \sum_{i, j=1}^V f(X_{ij}) \left( w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ik} \right)^2$,

where $w_j$, $\tilde{w}_j$, $b_i$, and $\tilde{b}_j$ are parameters to learn. $f(x)$ is a weighting function. (Refer the details to the paper.) [Pennington, Socher & Manning 2014]

As Radim Řehůřek said in his blog entry, [Řehůřek 2014] it is a neat paper, but their evaluation is crappy.

This theory explained why certain similar relations can be achieved, such as Paris – France is roughly equal to Beijing – China, as both can be transformed to the ratio in the definition of $F$ above.

It is a neat paper, as it employs optimization theory and probability theory, without any dark box deep learning.