In 2019, Google published a new Python library called “tensornetwork” (arXiv:1905.01330) that facilitates the computation of… tensor networks. Tensor network is a tool from quantum many-body theory, widely used in condensed matter physics. There have been a lot of numerical packages for tensor computation, but this library takes it to the next level because of its distinctive framework.

What is a tensor network, though?

“A tensor network is a collection of tensors with indices connected according to a network pattern. It can be used to efficiently represent a many-body wave-function in an otherwise exponentially large Hilbert space.”

https://www.perimeterinstitute.ca/research/research-initiatives/tensor-networks-initiative

## Renormalization Group (RG)

It is not until recently that tensor networks have its application in machine learning. As stated in a previous post, a mathematical connection between restricted Boltzmann machine (RBM) and variational renormalization group (RG) was drawn. (arXiv:1410.1831) It shedded light to the understanding of interpretability of deep learning, which has been criticized to be a black box. However, RBM is just a type of unsupervised machine learning, but how about others?

Seeing this, Schwab, one of the authors of the RG paper, and Stoudenmire did some work to realize the use of RG in machine learning. Stoudenmore is a physicist, and he made use of density matrix renormalization group (DMRG) that he is familiar with, and invented a supervised learning algorithm, which is later renamed tensor network machine learning (TNML). The training is adapted from the sweeping algorithm, the standard of DMRG, that combining bipartite site one-by-one, updating it, and decomposing into two site by studying its quantum entanglment (using singular value decomposition, or Schmidt decomposition).

Instead of bringing interpretability to deep learning, this work in fact opened a new path of new machine learning algorithms with known techniques.

## What is RG?

Renormalization group (RG) is a formalism of “zooming out” in scale-invariant system, determining which terms to truncate in a model. It is an important formalism in high energy physics and statistical field theory. (See Ma’s book for reference.)

Density matrix renormalization group (RG) is a variational real-space numerical technique that look at collections of quantum bits (zoomed-out) as a block. It was invented by Steven White, and it is useful in studying strongly correlated electronic systems. (PRL 69 (19): 2863-2866 (1992)). However, the original DMRG paper is not very accessible, until it is rephrased using the tensor network notation (TNN), as shown in Schollwoeck’s article.

## Is Tensor Network Related to Quantum Computing?

This is not an easy question to answer. Tensor networks come from quantum physics, but quantum physics is usually not directly leading to quantum computing. In fact, classical computing hardwares have a lot of quantum physics in it. A simple answer to this question is no, as the algorithm using tensor network is implemented in classical computers.

There have been a lot of publications on quantum machine learning lately. A classic book on this topic is written by Peter Wittek. The book covers topics on basic machine learning and quantum computing, and then quantum machine learning algorithms. There is a quantum counterpart of each of the common machine learning algorithms in the book. However, we know it would be much more useful if there are new algorithms exploiting the advantages of quantum computing. Tensor network is a natural choice as it builds on qubits, and the representations and operations are naturally quantum.

## Next…

Tensor network is an interesting subject from both a theoretical and applicational perspective. In coming posts I will talk about its application on machine learning and a taste of codes.

• Chase Roberts, Ashley Milsted, Martin Ganahl, Adam Zalcman, Bruce Fontaine, Yijian Zou, Jack Hidary, Guifre Vidal, Stefan Leichenauer, “TensorNetwork: A Library for Physics and Machine Learning,” arXiv:1905.01330 (2019). [arXiv]
• “Google TensorNetwork Library Dramatically Accelerates ML & Physics Tasks,” Syncedreview. (2019) [Medium]
• Chase Roberts, “Introducing TensorNetwork, an Open Source Library for Efficient Tensor Calculations,” Google AI Blog. (2019) [GoogleAIBlog]
• “Tensor Networks and Density Matrix Renormalization Group,” Everything About Data Analytics. (2016) [WordPress]
• P. Mehta, D. J. Schwab, “An exact mapping between the Variational Renormalization Group and Deep Learning,” arXiv:1410.3831 (2014). [arXiv]
• Sheng-kang Ma, Modern Theory of Critical Phenomena, (New York, NY: Routledge, 2018). [Amazon]
• S. R. White, “Density matrix formulation for quantum renormalization groups,” Phys. Rev. Lett. 69, 2863 (1992). [APS]
• Ulrich Schollwoeck, “The density-matrix renormalization group,” Rev. Mod. Phys. 77, 259 (2005); arXiv:cond-mat/0409292. [arXiv]
• Ulrich Schollwoeck, “The density-matrix renormalization group in the age of matrix product states,” Annals of Physics 326, 96 (2011); arXiv:1008.3477. [arXiv]
• Peter Wittek, Quantum Machine Learning: What Quantum Computing Means to Data Mining (San Diego, CA: Academic Press, 2014). [Amazon] [PDF]
• Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, Nathan Wiebe, Seth Lloyd, “Quantum Machine Learning,” Nature 549, 195-202 (2017). [Nature]
• Tensor Networks: From Entangled Quantum Matter to Emergent Space Time, Perimeter Institute. [Perimeter]

Feature picture taken from Perimeter Institute.

Deep learning has achieved a big success in the past few years, but its interpretive power is limited. They work largely because of the abundance of data. On the other hand, traditional machine learning algorithms are much better in interpretive power, but manual feature engineering costs a lot, due to the lack of data in earlier era. In light of this, a group of scientists initiated the work of graph networks, aiming at devising new artificial intelligence algorithms that exploits the advantages of two worlds, while still holding the principle of combinatorial generalization in constructing methods by using known building blocks to build new methods. Graph is good at interpretation as it is good for relational representation.

The use of graph networks is more than the graph convolutional neural networks (GCN) in the previous two blog entries. (part I and part II) However, to achieve relational inductive biases, an entity (an element with attributes), a relation, (a property between entities) and a rule. (a function that maps entities and relations to other entities and relations) This can be realized using graph, which is a mathematical structure that contains nodes and edges (that connect nodes.) To generalize the use of graph networks in various machine learning and deep learning methods, they reviewed the graph block, which is basically a function, or a mapping, from a graph to another graph, as shown in the algorithm below:

Works of graph networks are not non-existent; the authors listed previous works that can be seen as graph networks, for example:

• Message-passing neural network (MPNN) (2017);
• Non-local neural networks (NLNN) (2018).

The use of graph networks, I believe, is the next trend. There have been works regarding the graph-powered machine learning. (see Google AI blog, GraphAware Slideshare) I recently started an open-source project, a Python package called graphflow, to explore various algorithms using graphs, including PageRank, HITS, resistance, and non-linear resistance.

It has been a while since I wrote about topological data analysis (TDA). For pedagogical reasons, a lot of the codes were demonstrated in the Github repository PyTDA. However, it is not modularized as a package, and those codes run in Python 2.7 only.

Upon a few inquiries, I decided to release the codes as a PyPI package, and I named it mogutda, under the MIT license. It is open-source, and the codes can be found at the Github repository MoguTDA. It runs in Python 2.7, 3.5, and 3.6.

Quantum computation was proposed initially partly to simulate the physical universe because of the likeness of the nature and quantum systems. Some experimental simulations of Hawking radiation or Kibble-Zurek mechanisms were carried out in condensed matter systems, but they are simply too expensive to carry out. However, some scientists performed simulations on molecular systems using a quantum computer with an array of superconducting qubits. They performed the electronic structure calculation, as reported in “Scalable Quantum Simulation of Molecular Energies,” published in Physical Review X. Later, Google’s Quantum AI Team, Microsoft’s QuArC Team, and Caltech reports their work on simulating electronic structure using a quantum computer, that reduces running time but increases accuracies. Their work was reported in “Low-Depth Quantum Simulation of Materials,” also published in Physical Review X. The same team, adding a Harvard’s group, further studied the application of these molecular systems lined up as a linear array to design algorithms in quantum computers. It is reported in “Quantum Simulation of Electronic Structure with Linear Depth and Connectivity,” published in Physical Review Letters.

These people published an open-source software package, a Python library, called OpenFermion. It facilitates simulation of quantum algorithms in fermionic systems.

For a completeness, a few years ago, another group of scientists published a Python package, QuTiP, that helps simulating the open quantum systems.

Almost three years ago, I wrote a blog entry titled Useful Python Packages, which listed the essential packages that I deemed important. How has the list been changed over the past three years?

First of all, three years ago, most people were still writing Python 2.7. But now there is a trend to switch to Python 3. I admitted that I still have not started the switch yet, but in the short term, I will have no choice and I will.

What are some of the essential packages?
Numerical Packages

• numpy: numerical Python, containing most basic numerical routines such as matrix manipulation, linear algebra, random sampling, numerical integration etc. There is a built-in wrapper for Fortran as well. Actually, numpy is so important that some Linux system includes it with Python.
• scipy: scientific Python, containing some functions useful for scientific computing, such as sparse matrices, numerical differential equations, advanced linear algebra, special functions etc.
• networkx: package that handles various types of networks
• PuLP: linear programming
• cvxopt: convex optimization

Data Visualization

• matplotlib: basic plotting.
• ggplot2: the ggplot2 counterpart in Python for producing quality publication plots.

Data Manipulation

• pandas: data manipulation, working with data frames in Python, and save/load of various formats such as CSV and Excel

Machine Learning

• scikit-learn: machine-learning library in Python, containing classes and functions for supervised and unsupervised learning

Probabilistic Programming

• PyMC: Metropolis-Hasting algorithm
• Edward: deep probabilistic programing

Deep Learning Frameworks

• TensorFlow: because of Google’s marketing effort, TensorFlow is now the industrial standard for building deep learning networks, with rich source of mathematical functions, esp. for neural network cells, with GPU capability
• Keras: containing routines of high-level layers for deep learning neural networks, with TensorFlow, Theano, or CNTK as the backbone
• PyTorch: a rivalry against TensorFlow

Natural Language Processing

• nltk: natural language processing toolkit for Python, containing bag-of-words model, tokenizer, stemmers, chunker, lemmatizers, part-of-speech taggers etc.
• gensim: a useful natural language processing package useful for topic modeling, word-embedding, latent semantic indexing etc., running in a fast fashion
• shorttext: text mining package good for handling short sentences, that provide high-level routines for training neural network classifiers, or generating feature represented by topic models or autoencodings.
• spacy: industrial standard for natural language processing common tools

GUI

I can probably list more, but I think I covered most of them. If you do not find something useful, it is probably time for you to write a brand new package.

In text mining, it is important to create the document-term matrix (DTM) of the corpus we are interested in. A DTM is basically a matrix, with documents designated by rows and words by columns, that the elements are the counts or the weights (usually by tf-idf). Subsequent analysis is usually based creatively on DTM.

Exploring with DTM therefore becomes an important issues with a good text-mining tool. How do we perform exploratory data analysis on DTM using R and Python? We will demonstrate it using the data set of U. S. Presidents’ Inaugural Address, preprocessed, and can be downloaded here.

# R: textmineR

In R, we can use the package textmineR, which has been in introduced in a previous post. Together with other packages such as dplyr (for tidy data analysis) and snowBall (for stemming), load all of them at the beginning:

library(dplyr)
library(textmineR)
library(SnowballC)


usprez.df<- read.csv('inaugural.csv', stringsAsFactors = FALSE)


Then we create the DTM, while we remove all digits and punctuations, make all letters lowercase, and stem all words using Porter stemmer.

dtm<- CreateDtm(usprez.df$speech, doc_names = usprez.df$yrprez,
ngram_window = c(1, 1),
lower = TRUE,
remove_punctuation = TRUE,
remove_numbers = TRUE,
stem_lemma_function = wordStem)


Then defining a set of functions:

get.doc.tokens<- function(dtm, docid)
dtm[docid, ] %>% as.data.frame() %>% rename(count=".") %>%
mutate(token=row.names(.)) %>% arrange(-count)

get.token.occurrences<- function(dtm, token)
dtm[, token] %>% as.data.frame() %>% rename(count=".") %>%
mutate(token=row.names(.)) %>% arrange(-count)

get.total.freq<- function(dtm, token) dtm[, token] %>% sum

get.doc.freq<- function(dtm, token)
dtm[, token] %>% as.data.frame() %>% rename(count=".") %>%
filter(count>0) %>% pull(count) %>% length


Then we can happily extract information. For example, if we want to get the top-most common words in 2009’s Obama’s speech, enter:

dtm %>% get.doc.tokens('2009-Obama') %>% head(10)


Or which speeches have the word “change”: (but need to stem the word before extraction)

dtm %>% get.token.occurrences(wordStem('change')) %>% head(10)


You can also get the total number of occurrence of the words by:

dtm %>% get.doc.freq(wordStem('change'))   # gives 28


# Python: shorttext

In Python, similar things can be done using the package shorttext, described in a previous post. It uses other packages such as pandas and stemming. Load all packages first:

import shorttext
import numpy as np
import pandas as pd
from stemming.porter import stem

import re


And define the preprocessing pipelines:

pipeline = [lambda s: re.sub('[^\w\s]', '', s),
lambda s: re.sub('[\d]', '', s),
lambda s: s.lower(),
lambda s: ' '.join(map(stem, shorttext.utils.tokenize(s)))
]
txtpreproceesor = shorttext.utils.text_preprocessor(pipeline)


The function <code>txtpreprocessor</code> above perform the functions we talked about in R.

usprezdf = pd.read_csv('inaugural.csv')


The corpus needs to be preprocessed before putting into the DTM:

docids = list(usprezdf['yrprez'])    # defining document IDs
corpus = [txtpreproceesor(speech).split(' ') for speech in usprezdf['speech']]


Then create the DTM:

dtm = shorttext.utils.DocumentTermMatrix(corpus, docids=docids, tfidf=False)


Then we do the same thing as we have done above. To get the top-most common words in 2009’s Obama’s speech, enter:

dtm.get_doc_tokens('2009-Obama')


Or we look up which speeches have the word “change”:

dtm.get_token_occurences(stem('change'))


Or to get the document frequency of the word:

dtm.get_doc_frequency(stem('change'))


They Python and R codes give different document frequencies probably because the two stemmers work slightly differently.

Quantum computing has been picking up the momentum, and there are many startups and scholars discussing quantum machine learning. A basic knowledge of quantum two-level computation ought to be acquired.

Recently, Rigetti, a startup for quantum computing service in Bay Area, published that they opened to public their cloud server for users to simulate the use of quantum instruction language, as described in their blog and their White Paper. It is free.

Go to their homepage, http://rigetti.com/, click on “Get Started,” and fill in your information and e-mail. Then you will be e-mailed keys of your cloud account. Copy the information to a file .pyquil_config, and in your .bash_profile, add a line

export PYQUIL_CONFIG="\$HOME/.pyquil_config"

More information can be found in their Installation tutorial. Then install the Python package pyquil, by typing in the command line:

pip install -U pyquil

Some of you may need to root (adding sudo in front).

Then we can go ahead to open Python, or iPython, or Jupyter notebook, to play with it. For the time being, let me play with creating an entangled singlet state, $\frac{1}{\sqrt{2}} (|01\rangle - |10\rangle )$. The corresponding quantum circuit is like this:

First of all, import all necessary libraries:

import numpy as np

from pyquil.quil import Program
import pyquil.api as api
from pyquil.gates import H, X, Z, CNOT


You can see that the package includes a lot of quantum gates. First, we need to instantiate a quantum simulator:

# starting the quantum simulator
quantum_simulator = api.SyncConnection()


Then we implement the quantum circuit with a “program” as follow:

# generating singlet state
# 2. Pauli-Z
# 3. CNOT
# 4. NOT
p = Program(H(0), Z(0), CNOT(0, 1), X(1))
wavefunc, _ = quantum_simulator.wavefunction(p)


The last line gives the final wavefunction after running the quantum circuit, or “program.” For the ket, the rightmost qubit is qubit 0, and the left of it is qubit 1, and so on. Therefore, in the first line of the program, H, the Hadamard gate, acts on qubit 0, i.e., the rightmost qubit. Running a simple print statement:

print wavefunc


gives

(-0.7071067812+0j)|01> + (0.7071067812+0j)|10>

The coefficients are complex, and the imaginary part is described by j. You can extract it as a numpy array:

wavefunc.amplitudes


If we want to calculate the metric of entanglement, we can use the Python package pyqentangle, which can be installed by running on the console:

pip install -U pyqentangle

Import them:

from pyqentangle import schmidt_decomposition
from pyqentangle.schmidt import bipartitepurestate_reduceddensitymatrix
from pyqentangle.metrics import entanglement_entropy, negativity


Because pyqentangle does not recognize the coefficients in the same way as pyquil, but see each element as the coefficients of $|j i \rangle$, we need to reshape the final state first, by:

tensorcomp = wavefunc.amplitudes.reshape((2, 2))


Then perform Schmidt decomposition (which the Schmidt modes are actually trivial in this example):

# Schmidt decomposition
schmidt_modes = schmidt_decomposition(tensorcomp)
for prob, modeA, modeB in schmidt_modes:
print prob, ' : ', modeA, ' ', modeB


This outputs:

0.5  :  [ 0.+0.j  1.+0.j]   [ 1.+0.j  0.+0.j]
0.5  :  [-1.+0.j  0.+0.j]   [ 0.+0.j  1.+0.j]

Calculate the entanglement entropy and negativity from its reduced density matrix:

print 'Entanglement entropy = ', entanglement_entropy(bipartitepurestate_reduceddensitymatrix(tensorcomp, 0))
print 'Negativity = ', negativity(bipartitepurestate_reduceddensitymatrix(tensorcomp, 0))


which prints:

Entanglement entropy =  0.69314718056
Negativity =  -1.11022302463e-16

The calculation can be found in this thesis.

P.S.: The circuit was drawn by using the tool in this website, introduced by the Marco Cezero’s blog post. The corresponding json for the circuit is:

{"gate":[],{"gate":[], "circuit": [{"type":"h", "time":0, "targets":[0], "controls":[]},     {"type":"z", "time":1, "targets":[0], "controls":[]},     {"type":"x", "time":2, "targets":[1], "controls":[0]},     {"type":"x", "time":3, "targets":[1], "controls":[]}], "qubits":2,"input":[0,0]}
Continue reading "A First Glimpse of Rigetti’s Quantum Computing Cloud" →

The Python package for text mining shorttext has a new release: 0.5.4. It can be installed by typing in the command line:

pip install -U shorttext

For some people, you may need to install it from “root”, i.e., adding sudo in front of the command. Since the version 0.5 (including releases 0.5.1 and 0.5.4), there have been substantial addition of functionality, mostly about comparisons between short phrases without running a supervised or unsupervised machine learning algorithm, but calculating the “similarity” with various metrics, including:

• soft Jaccard score (the same kind of fuzzy scores based on edit distance in SOCcer),
• Word Mover’s distance (WMD, detailedly described in a previous post), and
• Jaccard index due to word-embedding model.

For the soft Jaccard score due to edit distance, we can call it by:

>>> from shorttext.metrics.dynprog import soft_jaccard_score
>>> soft_jaccard_score(['book', 'seller'], ['blok', 'sellers'])     # gives 0.6716417910447762
>>> soft_jaccard_score(['police', 'station'], ['policeman'])        # gives 0.2857142857142858

The core of this code was written in C, and interfaced to Python using SWIG.

For the Word Mover’s Distance (WMD), while the source codes are the same as my previous post, it can now be called directly. First, load the modules and the word-embedding model:

>>> from shorttext.metrics.wasserstein import word_mover_distance
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')

And compute the WMD with a single function:

>>> word_mover_distance(['police', 'station'], ['policeman'], wvmodel)                      # gives 3.060708999633789
>>> word_mover_distance(['physician', 'assistant'], ['doctor', 'assistants'], wvmodel)      # gives 2.276337146759033

And the Jaccard index due to cosine distance in Word-embedding model can be called like this:

>>> from shorttext.metrics.embedfuzzy import jaccardscore_sents
>>> jaccardscore_sents('doctor', 'physician', wvmodel)   # gives 0.6401538990056869
>>> jaccardscore_sents('chief executive', 'computer cluster', wvmodel)   # gives 0.0022515450768836143
>>> jaccardscore_sents('topological data', 'data of topology', wvmodel)   # gives 0.67588977344632573

Most new functions can be found in this tutorial.

And there are some minor bugs fixed.

Much about the use of word-embedding models such as Word2Vec and GloVe have been covered. However, how to measure the similarity between phrases or documents? One natural choice is the cosine similarity, as I have toyed with in a previous post. However, it smoothed out the influence of each word. Two years ago, a group in Washington University in St. Louis proposed the Word Mover’s Distance (WMD) in a PMLR paper that captures the relations between words, not simply by distance, but also the “transportation” from one phrase to another conveyed by each word. This Word Mover’s Distance (WMD) can be seen as a special case of Earth Mover’s Distance (EMD), or Wasserstein distance, the one people talked about in Wasserstein GAN. This is better than bag-of-words (BOW) model in a way that the word vectors capture the semantic similarities between words.

# Word Mover’s Distance (WMD)

The formulation of WMD is beautiful. Consider the embedded word vectors $\mathbf{X} \in R^{d \times n}$, where $d$ is the dimension of the embeddings, and $n$ is the number of words. For each phrase, there is a normalized BOW vector $d \in R^n$, and $d_i = \frac{c_i}{\sum_i c_i}$, where $i$‘s denote the word tokens. The distance between words are the Euclidean distance of their embedded word vectors, denoted by $c(i, j) = || \mathbf{x}_i - \mathbf{x}_j ||_2$, where $i$ and $j$ denote word tokens. The document distance, which is WMD here, is defined by $\sum_{i, j} \mathbf{T}_{i j} c(i, j)$, where $\mathbf{T}$ is a $n \times n$ matrix. Each element $\mathbf{T}_{ij} \geq 0$ denote how nuch of word $i$ in the first document (denoted by $\mathbf{d}$) travels to word $j$ in the new document (denoted by $\mathbf{d}'$).

Then the problem becomes the minimization of the document distance, or the WMD, and is formulated as:

$\text{min}_{\mathbf{T} \geq 0} \sum_{i, j=1}^n \mathbf{T}_{ij} c(i, j)$,

given the constraints:

$\sum_{j=1}^n \mathbf{T}_{ij} = d_i$, and

$\sum_{i=1}^n \mathbf{T}_{ij} = d_j'$.

This is essentially a simplified case of the Earth Mover’s distance (EMD), or the Wasserstein distance. (See the review by Gibbs and Su.)

# Using PuLP

The WMD is essentially a linear optimization problem. There are many optimization packages on the market, and my stance is that, for those common ones, there are no packages that are superior than others. In my job, I happened to handle a missing data problem, in turn becoming a non-linear optimization problem with linear constraints, and I chose limSolve, after I shop around. But I actually like a lot of other packages too. For WMD problem, I first tried out cvxopt first, which should actually solve the exact same problem, but the indexing is hard to maintain. Because I am dealing with words, it is good to have a direct hash map, or a dictionary. I can use the Dictionary class in gensim. But I later found out I should use PuLP, as it allows indices with words as a hash map (dict in Python), and WMD is a linear programming problem, making PuLP is a perfect choice, considering code efficiency.

An example of using PuLP can be demonstrated by the British 1997 UG Exam, as in the first problem of this link, with the Jupyter Notebook demonstrating this.

# Implementation of WMD using PuLP

The demonstration can be found in the Jupyter Notebook.

from itertools import product
from collections import defaultdict

import numpy as np
from scipy.spatial.distance import euclidean
import pulp
import gensim

Then define the functions the gives the BOW document vectors:

def tokens_to_fracdict(tokens):
cntdict = defaultdict(lambda : 0)
for token in tokens:
cntdict[token] += 1
totalcnt = sum(cntdict.values())
return {token: float(cnt)/totalcnt for token, cnt in cntdict.items()}

Then implement the core calculation. Note that PuLP is actually a symbolic computing package. This function return a pulp.LpProblem class:

def word_mover_distance_probspec(first_sent_tokens, second_sent_tokens, wvmodel, lpFile=None):
all_tokens = list(set(first_sent_tokens+second_sent_tokens))
wordvecs = {token: wvmodel[token] for token in all_tokens}

first_sent_buckets = tokens_to_fracdict(first_sent_tokens)
second_sent_buckets = tokens_to_fracdict(second_sent_tokens)

T = pulp.LpVariable.dicts('T_matrix', list(product(all_tokens, all_tokens)), lowBound=0)

prob = pulp.LpProblem('WMD', sense=pulp.LpMinimize)
prob += pulp.lpSum([T[token1, token2]*euclidean(wordvecs[token1], wordvecs[token2])
for token1, token2 in product(all_tokens, all_tokens)])
for token2 in second_sent_buckets:
prob += pulp.lpSum([T[token1, token2] for token1 in first_sent_buckets])==second_sent_buckets[token2]
for token1 in first_sent_buckets:
prob += pulp.lpSum([T[token1, token2] for token2 in second_sent_buckets])==first_sent_buckets[token1]

if lpFile!=None:
prob.writeLP(lpFile)

prob.solve()

return prob

To extract the value, just run pulp.value(prob.objective)

We use Google Word2Vec. Refer the $\mathbf{T}$ matrices in the Jupyter Notebook. Running this by a few examples:

1. document1 = President, talk, Chicago
document2 = President, speech, Illinois
WMD = 2.88587622936
2. document1 = physician, assistant
document2 = doctor
WMD = 2.8760048151
3. document1 = physician, assistant
document2 = doctor, assistant
WMD = 1.00465738773
(compare with example 2!)
4. document1 = doctors, assistant
document2 = doctor, assistant
WMD = 1.02825379372
(compare with example 3!)
5. document1 = doctor, assistant
document2 = doctor, assistant
WMD = 0.0
(totally identical; compare with example 3!)

There are more examples in the notebook.

# Conclusion

WMD is a good metric comparing two documents or sentences, by capturing the semantic meanings of the words. It is more powerful than BOW model as it captures the meaning similarities; it is more powerful than the cosine distance between average word vectors, as the transfer of meaning using words from one document to another is considered. But it is not immune to the problem of misspelling.

This algorithm works well for short texts. However, when the documents become large, this formulation will be computationally expensive. The author actually suggested a few modifications, such as the removal of constraints, and word centroid distances.

Example codes can be found in my Github repository: stephenhky/PyWMD.

On 07/28/2017, shorttext published its release 0.4.1, with a few important updates. To install it, type the following in the OS X / Linux command line:

>>> pip install -U shorttext

The documentation in PythonHosted.org has been abandoned. It has been migrated to readthedocs.org. (URL: http://shorttext.readthedocs.io/ or http:// shorttext.rtfd.io)

## Exploiting the Word-Embedding Layer

This update is mainly due to an important update in gensim, motivated by earlier shorttext‘s effort in integrating scikit-learn and keras. And gensim also provides a keras layer, on the same footing as other neural networks, activation function, or dropout layers, for Word2Vec models. Because shorttext has been making use of keras layers for categorization, such advance in gensim in fact makes it a natural step to add an embedding layer of all neural networks provided in shorttext. How to do it? (See shorttext tutorial for “Deep Neural Networks with Word Embedding.”)

import shorttext
trainclassdict = shorttext.data.subjectkeywords() &nbsp; # load an example data set


To train a model, you can do it the old way, or do it the new way with additional gensim function:

kmodel = shorttext.classifiers.frameworks.CNNWordEmbed(wvmodel=wvmodel, nb_labels=len(trainclassdict.keys()), vecsize=100, with_gensim=True) &nbsp; # keras model, setting with_gensim=True
classifier = shorttext.classifiers.VarNNEmbeddedVecClassifier(wvmodel, with_gensim=True, vecsize=100) &nbsp; # instantiate the classifier, setting with_gensim=True
classifier.train(trainclassdict, kmodel)


The parameters with_gensim in both CNNWordEmbed and VarNNEmbeddedVecClassifier are set to be False by default, because of backward compatibility. However, setting it to be True will enable it to use the new gensim Word2Vec layer.

These change in gensim and shorttext are the works mainly contributed by Chinmaya Pancholi, a very bright student at Indian Institute of Technology, Kharagpur, and a GSoC (Google Summer of Code) student in 2017. He revolutionized gensim by integrating scikit-learn and keras into gensim. He also used what he did in gensim to improve the pipelines of shorttext. He provided valuable technical suggestions. You can read his GSoC proposal, and his blog posts in RaRe Technologies, Inc. Chinmaya has been diligently mentored by Ivan Menshikh and Lev Konstantinovskiy of RaRe Technologies.

## Maxent Classifier

Another important update is the adding of maximum entropy (maxent) classifier. (See the corresponding tutorial on “Maximum Entropy (MaxEnt) Classifier.”) I will devote a separate entry on the theory, but it is very easy to use it,

import shorttext
from shorttext.classifiers import MaxEntClassifier

classifier = MaxEntClassifier()


Use the NIHReports dataset as the example:

classdict = shorttext.data.nihreports()
classifier.train(classdict, nb_epochs=1000)


The classification is just like other classifiers provided by shorttext:

classifier.score('cancer immunology') # NCI tops the score
classifier.score('children health') # NIAID tops the score
classifier.score('Alzheimer disease and aging') # NIAID tops the score