Deep learning has achieved a big success in the past few years, but its interpretive power is limited. They work largely because of the abundance of data. On the other hand, traditional machine learning algorithms are much better in interpretive power, but manual feature engineering costs a lot, due to the lack of data in earlier era. In light of this, a group of scientists initiated the work of graph networks, aiming at devising new artificial intelligence algorithms that exploits the advantages of two worlds, while still holding the principle of combinatorial generalization in constructing methods by using known building blocks to build new methods. Graph is good at interpretation as it is good for relational representation.

The use of graph networks is more than the graph convolutional neural networks (GCN) in the previous two blog entries. (part I and part II) However, to achieve relational inductive biases, an entity (an element with attributes), a relation, (a property between entities) and a rule. (a function that maps entities and relations to other entities and relations) This can be realized using graph, which is a mathematical structure that contains nodes and edges (that connect nodes.) To generalize the use of graph networks in various machine learning and deep learning methods, they reviewed the graph block, which is basically a function, or a mapping, from a graph to another graph, as shown in the algorithm below:

Works of graph networks are not non-existent; the authors listed previous works that can be seen as graph networks, for example:

• Message-passing neural network (MPNN) (2017);
• Non-local neural networks (NLNN) (2018).

The use of graph networks, I believe, is the next trend. There have been works regarding the graph-powered machine learning. (see Google AI blog, GraphAware Slideshare) I recently started an open-source project, a Python package called graphflow, to explore various algorithms using graphs, including PageRank, HITS, resistance, and non-linear resistance.

In the previous post, the convolution of the graph Laplacian is defined in its graph Fourier space as outlined in the paper of Bruna et. al. (arXiv:1312.6203) However, the eigenmodes of the graph Laplacian are not ideal because it makes the bases to be graph-dependent. A lot of works were done in order to solve this problem, with the help of various special functions to express the filter functions. Examples include Chebyshev polynomials and Cayley transform.

## ChebNet

Kipf and Welling proposed the ChebNet (arXiv:1609.02907) to approximate the filter using Chebyshev polynomial, using the result proved by Hammond et. al. (arXiv:0912.3848) With a convolutional layer $g_{\theta} \circ x$, instead of computing it directly by $\Phi^T g_{\theta} \Phi x$, it approximates the filter in terms of Chebyshev polynomial:

$g_{\theta} (\Lambda) \approx \sum_{k=0}^K \theta_k T_k(\Lambda)$,

where $T_k(x)$ are the Chebyshev polynomials. It has been proved by Hammond et. al. that by well truncating this representation, Chebyshev polynomials are very good to approximate the function. Then the convolution is:

$g_{\theta} \circ x \approx \sum_{k=0}^K \theta_k T_k (\Lambda) x$.

Here, $\theta$’s are the parameters to be trained. This fixed the bases of the representation, and it speeds up computation. The disadvantage is that eigenvalues are clusters in a few values with large gaps.

## CayleyNet

The problem of ChebNet led to the work of Levie et. al., (arXiv:1705.07664) who proposed another approximation is used using Cayley transform. They made use of the Cayley function:

$C(\lambda) = \frac{\lambda - i}{\lambda + i}$,

which is a bijective function from $\mathbb{R}^+$ to complex unit half-circle. Instead of Chebyshev polynomials, it approximates the filter as:

$g(\lambda) = c_0 + \sum_{j=1}^r \left[ c_j C^j (h \lambda) + c_j^{\ast} C^{j \ast} (h \lambda) \right]$,

where $c_0$ is real and other $c_j$’s are generally complex, and $h$ is a zoom parameter, and $\lambda$’s are the eigenvalues of the graph Laplacian. Tuning $h$ makes one find the best zoom that spread the top eigenvalues. $c$‘s are computed by training. This solves the problem of unfavorable clusters in ChebNet.

## MotifNet

All previous works are undirected graph. How do we deal with directed graph with an asymmetric graph Laplacian? Benson, Gleich, Leskovec published an important work on Science in 2016 (arXiv:1612.08447) to address this problem. Their approach is reducing a directed graph to a higher order structure called network motifs. There are 13 network motifs. For each network motif, one can define an adjacency matrix for that motif by $\mathcal{M}_k$, with elements $\mathcal{M}_{k, ij}$ being the number of motifs in the graph the the edge $(i, j)$ that it belongs to.

Then one computes 13 graph Laplacians from these 13 adjacency matrices. These graph Laplacians are symmetric, like those of undirected graphs. Then any filters can be approximated by the following multivariate matrix polynomial, as suggested by Monti, Otness, and Bronstein in their MotifNet paper (arXiv:1802.01572):

$f_{\Theta} (\Delta_1, \dots, \Delta_k) = \sum_{j=0}^P \sum_{k_1, \dots, k_j \in {1, \ldots, K}} \theta_{\theta_1, \ldots, \theta_k} \Delta_{k_1}, \ldots, \Delta_{k_j}$.

Applications of image processing, citation networks etc.

The Python package for text mining shorttext has a new release: 0.5.4. It can be installed by typing in the command line:

pip install -U shorttext

For some people, you may need to install it from “root”, i.e., adding sudo in front of the command. Since the version 0.5 (including releases 0.5.1 and 0.5.4), there have been substantial addition of functionality, mostly about comparisons between short phrases without running a supervised or unsupervised machine learning algorithm, but calculating the “similarity” with various metrics, including:

• soft Jaccard score (the same kind of fuzzy scores based on edit distance in SOCcer),
• Word Mover’s distance (WMD, detailedly described in a previous post), and
• Jaccard index due to word-embedding model.

For the soft Jaccard score due to edit distance, we can call it by:

>>> from shorttext.metrics.dynprog import soft_jaccard_score
>>> soft_jaccard_score(['book', 'seller'], ['blok', 'sellers'])     # gives 0.6716417910447762
>>> soft_jaccard_score(['police', 'station'], ['policeman'])        # gives 0.2857142857142858

The core of this code was written in C, and interfaced to Python using SWIG.

For the Word Mover’s Distance (WMD), while the source codes are the same as my previous post, it can now be called directly. First, load the modules and the word-embedding model:

>>> from shorttext.metrics.wasserstein import word_mover_distance
>>> from shorttext.utils import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')

And compute the WMD with a single function:

>>> word_mover_distance(['police', 'station'], ['policeman'], wvmodel)                      # gives 3.060708999633789
>>> word_mover_distance(['physician', 'assistant'], ['doctor', 'assistants'], wvmodel)      # gives 2.276337146759033

And the Jaccard index due to cosine distance in Word-embedding model can be called like this:

>>> from shorttext.metrics.embedfuzzy import jaccardscore_sents
>>> jaccardscore_sents('doctor', 'physician', wvmodel)   # gives 0.6401538990056869
>>> jaccardscore_sents('chief executive', 'computer cluster', wvmodel)   # gives 0.0022515450768836143
>>> jaccardscore_sents('topological data', 'data of topology', wvmodel)   # gives 0.67588977344632573

Most new functions can be found in this tutorial.

And there are some minor bugs fixed.

Much about the use of word-embedding models such as Word2Vec and GloVe have been covered. However, how to measure the similarity between phrases or documents? One natural choice is the cosine similarity, as I have toyed with in a previous post. However, it smoothed out the influence of each word. Two years ago, a group in Washington University in St. Louis proposed the Word Mover’s Distance (WMD) in a PMLR paper that captures the relations between words, not simply by distance, but also the “transportation” from one phrase to another conveyed by each word. This Word Mover’s Distance (WMD) can be seen as a special case of Earth Mover’s Distance (EMD), or Wasserstein distance, the one people talked about in Wasserstein GAN. This is better than bag-of-words (BOW) model in a way that the word vectors capture the semantic similarities between words.

# Word Mover’s Distance (WMD)

The formulation of WMD is beautiful. Consider the embedded word vectors $\mathbf{X} \in R^{d \times n}$, where $d$ is the dimension of the embeddings, and $n$ is the number of words. For each phrase, there is a normalized BOW vector $d \in R^n$, and $d_i = \frac{c_i}{\sum_i c_i}$, where $i$‘s denote the word tokens. The distance between words are the Euclidean distance of their embedded word vectors, denoted by $c(i, j) = || \mathbf{x}_i - \mathbf{x}_j ||_2$, where $i$ and $j$ denote word tokens. The document distance, which is WMD here, is defined by $\sum_{i, j} \mathbf{T}_{i j} c(i, j)$, where $\mathbf{T}$ is a $n \times n$ matrix. Each element $\mathbf{T}_{ij} \geq 0$ denote how nuch of word $i$ in the first document (denoted by $\mathbf{d}$) travels to word $j$ in the new document (denoted by $\mathbf{d}'$).

Then the problem becomes the minimization of the document distance, or the WMD, and is formulated as:

$\text{min}_{\mathbf{T} \geq 0} \sum_{i, j=1}^n \mathbf{T}_{ij} c(i, j)$,

given the constraints:

$\sum_{j=1}^n \mathbf{T}_{ij} = d_i$, and

$\sum_{i=1}^n \mathbf{T}_{ij} = d_j'$.

This is essentially a simplified case of the Earth Mover’s distance (EMD), or the Wasserstein distance. (See the review by Gibbs and Su.)

# Using PuLP

The WMD is essentially a linear optimization problem. There are many optimization packages on the market, and my stance is that, for those common ones, there are no packages that are superior than others. In my job, I happened to handle a missing data problem, in turn becoming a non-linear optimization problem with linear constraints, and I chose limSolve, after I shop around. But I actually like a lot of other packages too. For WMD problem, I first tried out cvxopt first, which should actually solve the exact same problem, but the indexing is hard to maintain. Because I am dealing with words, it is good to have a direct hash map, or a dictionary. I can use the Dictionary class in gensim. But I later found out I should use PuLP, as it allows indices with words as a hash map (dict in Python), and WMD is a linear programming problem, making PuLP is a perfect choice, considering code efficiency.

An example of using PuLP can be demonstrated by the British 1997 UG Exam, as in the first problem of this link, with the Jupyter Notebook demonstrating this.

# Implementation of WMD using PuLP

The demonstration can be found in the Jupyter Notebook.

Load the necessary packages:

from itertools import product
from collections import defaultdict

import numpy as np
from scipy.spatial.distance import euclidean
import pulp
import gensim

Then define the functions the gives the BOW document vectors:

def tokens_to_fracdict(tokens):
cntdict = defaultdict(lambda : 0)
for token in tokens:
cntdict[token] += 1
totalcnt = sum(cntdict.values())
return {token: float(cnt)/totalcnt for token, cnt in cntdict.items()}

Then implement the core calculation. Note that PuLP is actually a symbolic computing package. This function return a pulp.LpProblem class:

def word_mover_distance_probspec(first_sent_tokens, second_sent_tokens, wvmodel, lpFile=None):
all_tokens = list(set(first_sent_tokens+second_sent_tokens))
wordvecs = {token: wvmodel[token] for token in all_tokens}

first_sent_buckets = tokens_to_fracdict(first_sent_tokens)
second_sent_buckets = tokens_to_fracdict(second_sent_tokens)

T = pulp.LpVariable.dicts('T_matrix', list(product(all_tokens, all_tokens)), lowBound=0)

prob = pulp.LpProblem('WMD', sense=pulp.LpMinimize)
prob += pulp.lpSum([T[token1, token2]*euclidean(wordvecs[token1], wordvecs[token2])
for token1, token2 in product(all_tokens, all_tokens)])
for token2 in second_sent_buckets:
prob += pulp.lpSum([T[token1, token2] for token1 in first_sent_buckets])==second_sent_buckets[token2]
for token1 in first_sent_buckets:
prob += pulp.lpSum([T[token1, token2] for token2 in second_sent_buckets])==first_sent_buckets[token1]

if lpFile!=None:
prob.writeLP(lpFile)

prob.solve()

return prob

To extract the value, just run pulp.value(prob.objective)

We use Google Word2Vec. Refer the $\mathbf{T}$ matrices in the Jupyter Notebook. Running this by a few examples:

1. document1 = President, talk, Chicago
document2 = President, speech, Illinois
WMD = 2.88587622936
2. document1 = physician, assistant
document2 = doctor
WMD = 2.8760048151
3. document1 = physician, assistant
document2 = doctor, assistant
WMD = 1.00465738773
(compare with example 2!)
4. document1 = doctors, assistant
document2 = doctor, assistant
WMD = 1.02825379372
(compare with example 3!)
5. document1 = doctor, assistant
document2 = doctor, assistant
WMD = 0.0
(totally identical; compare with example 3!)

There are more examples in the notebook.

# Conclusion

WMD is a good metric comparing two documents or sentences, by capturing the semantic meanings of the words. It is more powerful than BOW model as it captures the meaning similarities; it is more powerful than the cosine distance between average word vectors, as the transfer of meaning using words from one document to another is considered. But it is not immune to the problem of misspelling.

This algorithm works well for short texts. However, when the documents become large, this formulation will be computationally expensive. The author actually suggested a few modifications, such as the removal of constraints, and word centroid distances.

Example codes can be found in my Github repository: stephenhky/PyWMD.

On 07/28/2017, shorttext published its release 0.4.1, with a few important updates. To install it, type the following in the OS X / Linux command line:

>>> pip install -U shorttext

The documentation in PythonHosted.org has been abandoned. It has been migrated to readthedocs.org. (URL: http://shorttext.readthedocs.io/ or http:// shorttext.rtfd.io)

## Exploiting the Word-Embedding Layer

This update is mainly due to an important update in gensim, motivated by earlier shorttext‘s effort in integrating scikit-learn and keras. And gensim also provides a keras layer, on the same footing as other neural networks, activation function, or dropout layers, for Word2Vec models. Because shorttext has been making use of keras layers for categorization, such advance in gensim in fact makes it a natural step to add an embedding layer of all neural networks provided in shorttext. How to do it? (See shorttext tutorial for “Deep Neural Networks with Word Embedding.”)

import shorttext
wvmodel = shorttext.utils.load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz') &nbsp; # load the pre-trained Word2Vec model
trainclassdict = shorttext.data.subjectkeywords() &nbsp; # load an example data set


To train a model, you can do it the old way, or do it the new way with additional gensim function:

kmodel = shorttext.classifiers.frameworks.CNNWordEmbed(wvmodel=wvmodel, nb_labels=len(trainclassdict.keys()), vecsize=100, with_gensim=True) &nbsp; # keras model, setting with_gensim=True
classifier = shorttext.classifiers.VarNNEmbeddedVecClassifier(wvmodel, with_gensim=True, vecsize=100) &nbsp; # instantiate the classifier, setting with_gensim=True
classifier.train(trainclassdict, kmodel)


The parameters with_gensim in both CNNWordEmbed and VarNNEmbeddedVecClassifier are set to be False by default, because of backward compatibility. However, setting it to be True will enable it to use the new gensim Word2Vec layer.

These change in gensim and shorttext are the works mainly contributed by Chinmaya Pancholi, a very bright student at Indian Institute of Technology, Kharagpur, and a GSoC (Google Summer of Code) student in 2017. He revolutionized gensim by integrating scikit-learn and keras into gensim. He also used what he did in gensim to improve the pipelines of shorttext. He provided valuable technical suggestions. You can read his GSoC proposal, and his blog posts in RaRe Technologies, Inc. Chinmaya has been diligently mentored by Ivan Menshikh and Lev Konstantinovskiy of RaRe Technologies.

## Maxent Classifier

Another important update is the adding of maximum entropy (maxent) classifier. (See the corresponding tutorial on “Maximum Entropy (MaxEnt) Classifier.”) I will devote a separate entry on the theory, but it is very easy to use it,

import shorttext
from shorttext.classifiers import MaxEntClassifier

classifier = MaxEntClassifier()


Use the NIHReports dataset as the example:

classdict = shorttext.data.nihreports()
classifier.train(classdict, nb_epochs=1000)


The classification is just like other classifiers provided by shorttext:

classifier.score('cancer immunology') # NCI tops the score
classifier.score('children health') # NIAID tops the score
classifier.score('Alzheimer disease and aging') # NIAID tops the score


In these few days, Facebook published a new research paper, regarding the use of sequence to sequence (seq2seq) model for machine translation. What is special about this seq2seq model is that it uses convolutional neural networks (ConvNet, or CNN), instead of recurrent neural networks (RNN).

The original seq2seq model is implemented with Long Short-Term Memory (LSTM) model, published by Google.(see their paper) It is basically a character-based model that generates texts according to a sequence of input characters. And the same author constructed a neural conversational model, (see their paper) as mentioned in a previous blog post. Daewoo Chong, from Booz Allen Hamilton, presented its implementation using Tensorflow in DC Data Education Meetup on April 13, 2017. Johns Hopkins also published a spell correction algorithm implemented in seq2seq. (see their paper) The real advantage of RNN over CNN is that there is no limit about the size of the tokens input or output.

While the fixing of the size of vectors for CNN is obvious, using CNN serves the purpose of limiting the size of input vectors, and thus limiting the size of contexts. This limits the contents, and speeds up the training process. RNN is known to be trained slow. Facebook uses this CNN seq2seq model for their machine translation model. For more details, take a look at their paper and their Github repository.

At the end of last month, Yahoo opened the sources of training a model to classify not suitable/safe for work (NSFW) images, particularly pornographic images, using convolutional neural network (CNN). It was implemented with Caffe. Users need to supply the training data, positive being the NSFW images, and negative being the suitable/safe for work (SFW) images, to train the model. The model takes an image as the input, and output a score between 0 and 1.

The codes are available on Github, with the README.md about the installation.

Ever since Mehta and Schwab laid out the relationship between restricted Boltzmann machines (RBM) and deep learning mathematically (see my previous entry), scientists have been discussing why deep learning works so well. Recently, Henry Lin and Max Tegmark put a preprint on arXiv (arXiv:1609.09225), arguing that deep learning works because it captures a few essential physical laws and properties. Tegmark is a cosmologist.

Physical laws are simple in a way that a few properties, such as locality, symmetry, hierarchy etc., lead to large-scale, universal, and often complex phenomena. A lot of machine learning algorithms, including deep learning algorithms, have deep relations with formalisms outlined in statistical mechanics.

A lot of machine learning algorithms are basically probability theory. They outlined a few types of algorithms that seek various types of probabilities. They related the probabilities to Hamiltonians in many-body systems.

They argued why neural networks can approximate functions (polynomials) so well, giving a simple neural network performing multiplication. With central limit theorem or Jaynes’ arguments (see my previous entry), a lot of multiplications, they said, can be approximated by low-order polynomial Hamiltonian. This is like a lot of many-body systems that can be approximated by 4-th order Landau-Ginzburg-Wilson (LGW) functional.

Properties such as locality reduces the number of hyper-parameters needed because it restricts to interactions among close proximities. Symmetry further reduces it, and also computational complexities. Symmetry and second order phase transition make scaling hypothesis possible, leading to the use of the tools such as renormalization group (RG). As many people have been arguing, deep learning resembles RG because it filters out unnecessary information and maps out the crucial features. Tegmark use classifying cats vs. dogs as an example, as in retrieving temperatures of a many-body systems using RG procedure. They gave a counter-example to Schwab’s paper with the probabilities cannot be preserved by RG procedure, but while it is sound, but it is not the point of the RG procedure anyway.

They also discussed about the no-flattening theorems for neural networks.

There are many tasks that involve coding, for example, putting kids into groups according to their age, labeling the webpages about their kinds, or putting students in Hogwarts into four colleges… And researchers or lawyers need to code people, according to their filled-in information, into occupations. Melissa Friesen, an investigator in Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NCI), National Institutes of Health (NIH), saw the need of large-scale coding. Many researchers are dealing with big data concerning epidemiology. She led a research project, in collaboration with Office of Intramural Research (OIR), Center for Information Technology (CIT), National Institutes of Health (NIH), to develop an artificial intelligence system to cope with the problem. This leads to a publicly available tool called SOCcer, an acronym for “Standardized Occupation Coding for Computer-assisted Epidemiological Research.” (URL: http://soccer.nci.nih.gov/soccer/)

The system was initially developed in an attempt to find the correlation between the onset of cancers and other diseases and the occupation. “The application is not intended to replace expert coders, but rather to prioritize which job descriptions would benefit most from expert review,” said Friesen in an interview. She mainly works with Daniel Russ in CIT.

SOCcer takes job title, industry codes (in terms of SIC, Standard Industrial Classification), and job duties, and gives an occupational code called SOC 2010 (Standard Occupational Classification), used by U. S. federal government agencies. The data involves short text, often messy. There are 840 codes in SOC 2010 systems. Conventional natural language processing (NLP) methods may not apply. Friesen, Russ, and Kwan-Yuet (Stephen) Ho (also in OIR, CIT; a CSRA staff) use fuzzy logic, and maximum entropy (maxent) methods, with some feature engineering, to build various classifiers. These classifiers are aggregated together, as in stacked generalization (see my previous entry), using logistic regression, to give a final score.

SOCcer has a companion software, called SOCAssign, for expert coders to prioritize the codings. It was awarded with DCEG Informatics Tool Challenge 2015. SOCcer itself was awarded in 2016. And the SOCcer team was awarded for Scientific Award of Merit by CIT/OCIO in 2016 as well (see this). Their work was published in Occup. Environ. Med.