Release of shorttext 0.5.4

The Python package for text mining shorttext has a new release: 0.5.4. It can be installed by typing in the command line:

pip install -U shorttext

For some people, you may need to install it from “root”, i.e., adding sudo in front of the command. Since the version 0.5 (including releases 0.5.1 and 0.5.4), there have been substantial addition of functionality, mostly about comparisons between short phrases without running a supervised or unsupervised machine learning algorithm, but calculating the “similarity” with various metrics, including:

  • soft Jaccard score (the same kind of fuzzy scores based on edit distance in SOCcer),
  • Word Mover’s distance (WMD, detailedly described in a previous post), and
  • Jaccard index due to word-embedding model.

For the soft Jaccard score due to edit distance, we can call it by:

>>> from shorttext.metrics.dynprog import soft_jaccard_score
>>> soft_jaccard_score(['book', 'seller'], ['blok', 'sellers'])     # gives 0.6716417910447762
>>> soft_jaccard_score(['police', 'station'], ['policeman'])        # gives 0.2857142857142858

The core of this code was written in C, and interfaced to Python using SWIG.

For the Word Mover’s Distance (WMD), while the source codes are the same as my previous post, it can now be called directly. First, load the modules and the word-embedding model:

>>> from shorttext.metrics.wasserstein import word_mover_distance
>>> from shorttext.utils import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')

And compute the WMD with a single function:

>>> word_mover_distance(['police', 'station'], ['policeman'], wvmodel)                      # gives 3.060708999633789
>>> word_mover_distance(['physician', 'assistant'], ['doctor', 'assistants'], wvmodel)      # gives 2.276337146759033

And the Jaccard index due to cosine distance in Word-embedding model can be called like this:

>>> from shorttext.metrics.embedfuzzy import jaccardscore_sents
>>> jaccardscore_sents('doctor', 'physician', wvmodel)   # gives 0.6401538990056869
>>> jaccardscore_sents('chief executive', 'computer cluster', wvmodel)   # gives 0.0022515450768836143
>>> jaccardscore_sents('topological data', 'data of topology', wvmodel)   # gives 0.67588977344632573

Most new functions can be found in this tutorial.

And there are some minor bugs fixed.

Continue reading “Release of shorttext 0.5.4”


Word Mover’s Distance as a Linear Programming Problem

Much about the use of word-embedding models such as Word2Vec and GloVe have been covered. However, how to measure the similarity between phrases or documents? One natural choice is the cosine similarity, as I have toyed with in a previous post. However, it smoothed out the influence of each word. Two years ago, a group in Washington University in St. Louis proposed the Word Mover’s Distance (WMD) in a PMLR paper that captures the relations between words, not simply by distance, but also the “transportation” from one phrase to another conveyed by each word. This Word Mover’s Distance (WMD) can be seen as a special case of Earth Mover’s Distance (EMD), or Wasserstein distance, the one people talked about in Wasserstein GAN. This is better than bag-of-words (BOW) model in a way that the word vectors capture the semantic similarities between words.

Word Mover’s Distance (WMD)

The formulation of WMD is beautiful. Consider the embedded word vectors \mathbf{X} \in R^{d \times n}, where d is the dimension of the embeddings, and n is the number of words. For each phrase, there is a normalized BOW vector d \in R^n, and d_i = \frac{c_i}{\sum_i c_i}, where i‘s denote the word tokens. The distance between words are the Euclidean distance of their embedded word vectors, denoted by c(i, j) = || \mathbf{x}_i - \mathbf{x}_j ||_2, where i and j denote word tokens. The document distance, which is WMD here, is defined by \sum_{i, j} \mathbf{T}_{i j} c(i, j), where \mathbf{T} is a n \times n matrix. Each element \mathbf{T}_{ij} \geq 0 denote how nuch of word i in the first document (denoted by \mathbf{d}) travels to word j in the new document (denoted by \mathbf{d}').

Then the problem becomes the minimization of the document distance, or the WMD, and is formulated as:

\text{min}_{\mathbf{T} \geq 0} \sum_{i, j=1}^n \mathbf{T}_{ij} c(i, j),

given the constraints:

\sum_{j=1}^n \mathbf{T}_{ij} = d_i, and

\sum_{i=1}^n \mathbf{T}_{ij} = d_j'.

This is essentially a simplified case of the Earth Mover’s distance (EMD), or the Wasserstein distance. (See the review by Gibbs and Su.)

Using PuLP

The WMD is essentially a linear optimization problem. There are many optimization packages on the market, and my stance is that, for those common ones, there are no packages that are superior than others. In my job, I happened to handle a missing data problem, in turn becoming a non-linear optimization problem with linear constraints, and I chose limSolve, after I shop around. But I actually like a lot of other packages too. For WMD problem, I first tried out cvxopt first, which should actually solve the exact same problem, but the indexing is hard to maintain. Because I am dealing with words, it is good to have a direct hash map, or a dictionary. I can use the Dictionary class in gensim. But I later found out I should use PuLP, as it allows indices with words as a hash map (dict in Python), and WMD is a linear programming problem, making PuLP is a perfect choice, considering code efficiency.

An example of using PuLP can be demonstrated by the British 1997 UG Exam, as in the first problem of this link, with the Jupyter Notebook demonstrating this.

Implementation of WMD using PuLP

The demonstration can be found in the Jupyter Notebook.

Load the necessary packages:

from itertools import product
from collections import defaultdict

import numpy as np
from scipy.spatial.distance import euclidean
import pulp
import gensim

Then define the functions the gives the BOW document vectors:

def tokens_to_fracdict(tokens):
    cntdict = defaultdict(lambda : 0)
    for token in tokens:
        cntdict[token] += 1
    totalcnt = sum(cntdict.values())
    return {token: float(cnt)/totalcnt for token, cnt in cntdict.items()}

Then implement the core calculation. Note that PuLP is actually a symbolic computing package. This function return a pulp.LpProblem class:

def word_mover_distance_probspec(first_sent_tokens, second_sent_tokens, wvmodel, lpFile=None):
    all_tokens = list(set(first_sent_tokens+second_sent_tokens))
    wordvecs = {token: wvmodel[token] for token in all_tokens}

    first_sent_buckets = tokens_to_fracdict(first_sent_tokens)
    second_sent_buckets = tokens_to_fracdict(second_sent_tokens)

    T = pulp.LpVariable.dicts('T_matrix', list(product(all_tokens, all_tokens)), lowBound=0)

    prob = pulp.LpProblem('WMD', sense=pulp.LpMinimize)
    prob += pulp.lpSum([T[token1, token2]*euclidean(wordvecs[token1], wordvecs[token2])
                        for token1, token2 in product(all_tokens, all_tokens)])
    for token2 in second_sent_buckets:
        prob += pulp.lpSum([T[token1, token2] for token1 in first_sent_buckets])==second_sent_buckets[token2]
    for token1 in first_sent_buckets:
        prob += pulp.lpSum([T[token1, token2] for token2 in second_sent_buckets])==first_sent_buckets[token1]

    if lpFile!=None:


    return prob

To extract the value, just run pulp.value(prob.objective)

We use Google Word2Vec. Refer the \mathbf{T} matrices in the Jupyter Notebook. Running this by a few examples:

  1. document1 = President, talk, Chicago
    document2 = President, speech, Illinois
    WMD = 2.88587622936
  2. document1 = physician, assistant
    document2 = doctor
    WMD = 2.8760048151
  3. document1 = physician, assistant
    document2 = doctor, assistant
    WMD = 1.00465738773
    (compare with example 2!)
  4. document1 = doctors, assistant
    document2 = doctor, assistant
    WMD = 1.02825379372
    (compare with example 3!)
  5. document1 = doctor, assistant
    document2 = doctor, assistant
    WMD = 0.0
    (totally identical; compare with example 3!)

There are more examples in the notebook.


WMD is a good metric comparing two documents or sentences, by capturing the semantic meanings of the words. It is more powerful than BOW model as it captures the meaning similarities; it is more powerful than the cosine distance between average word vectors, as the transfer of meaning using words from one document to another is considered. But it is not immune to the problem of misspelling.

This algorithm works well for short texts. However, when the documents become large, this formulation will be computationally expensive. The author actually suggested a few modifications, such as the removal of constraints, and word centroid distances.

Example codes can be found in my Github repository: stephenhky/PyWMD.

Continue reading “Word Mover’s Distance as a Linear Programming Problem”

Short Text Mining using Advanced Keras Layers and Maxent: shorttext 0.4.1

On 07/28/2017, shorttext published its release 0.4.1, with a few important updates. To install it, type the following in the OS X / Linux command line:

>>> pip install -U shorttext

The documentation in has been abandoned. It has been migrated to (URL: or http://

Exploiting the Word-Embedding Layer

This update is mainly due to an important update in gensim, motivated by earlier shorttext‘s effort in integrating scikit-learn and keras. And gensim also provides a keras layer, on the same footing as other neural networks, activation function, or dropout layers, for Word2Vec models. Because shorttext has been making use of keras layers for categorization, such advance in gensim in fact makes it a natural step to add an embedding layer of all neural networks provided in shorttext. How to do it? (See shorttext tutorial for “Deep Neural Networks with Word Embedding.”)

import shorttext
wvmodel = shorttext.utils.load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')   # load the pre-trained Word2Vec model
trainclassdict =   # load an example data set


To train a model, you can do it the old way, or do it the new way with additional gensim function:

kmodel = shorttext.classifiers.frameworks.CNNWordEmbed(wvmodel=wvmodel, nb_labels=len(trainclassdict.keys()), vecsize=100, with_gensim=True)   # keras model, setting with_gensim=True
classifier = shorttext.classifiers.VarNNEmbeddedVecClassifier(wvmodel, with_gensim=True, vecsize=100)   # instantiate the classifier, setting with_gensim=True
classifier.train(trainclassdict, kmodel)

The parameters with_gensim in both CNNWordEmbed and VarNNEmbeddedVecClassifier are set to be False by default, because of backward compatibility. However, setting it to be True will enable it to use the new gensim Word2Vec layer.

These change in gensim and shorttext are the works mainly contributed by Chinmaya Pancholi, a very bright student at Indian Institute of Technology, Kharagpur, and a GSoC (Google Summer of Code) student in 2017. He revolutionized gensim by integrating scikit-learn and keras into gensim. He also used what he did in gensim to improve the pipelines of shorttext. He provided valuable technical suggestions. You can read his GSoC proposal, and his blog posts in RaRe Technologies, Inc. Chinmaya has been diligently mentored by Ivan Menshikh and Lev Konstantinovskiy of RaRe Technologies.

Maxent Classifier

Another important update is the adding of maximum entropy (maxent) classifier. (See the corresponding tutorial on “Maximum Entropy (MaxEnt) Classifier.”) I will devote a separate entry on the theory, but it is very easy to use it,

import shorttext
from shorttext.classifiers import MaxEntClassifier

classifier = MaxEntClassifier()

Use the NIHReports dataset as the example:

classdict =
classifier.train(classdict, nb_epochs=1000)

The classification is just like other classifiers provided by shorttext:

classifier.score('cancer immunology') # NCI tops the score
classifier.score('children health') # NIAID tops the score
classifier.score('Alzheimer disease and aging') # NIAID tops the score

Continue reading “Short Text Mining using Advanced Keras Layers and Maxent: shorttext 0.4.1”

ConvNet Seq2seq for Machine Translation

In these few days, Facebook published a new research paper, regarding the use of sequence to sequence (seq2seq) model for machine translation. What is special about this seq2seq model is that it uses convolutional neural networks (ConvNet, or CNN), instead of recurrent neural networks (RNN).

The original seq2seq model is implemented with Long Short-Term Memory (LSTM) model, published by Google.(see their paper) It is basically a character-based model that generates texts according to a sequence of input characters. And the same author constructed a neural conversational model, (see their paper) as mentioned in a previous blog post. Daewoo Chong, from Booz Allen Hamilton, presented its implementation using Tensorflow in DC Data Education Meetup on April 13, 2017. Johns Hopkins also published a spell correction algorithm implemented in seq2seq. (see their paper) The real advantage of RNN over CNN is that there is no limit about the size of the tokens input or output.

While the fixing of the size of vectors for CNN is obvious, using CNN serves the purpose of limiting the size of input vectors, and thus limiting the size of contexts. This limits the contents, and speeds up the training process. RNN is known to be trained slow. Facebook uses this CNN seq2seq model for their machine translation model. For more details, take a look at their paper and their Github repository.


Continue reading “ConvNet Seq2seq for Machine Translation”

NSFW Image Classification

At the end of last month, Yahoo opened the sources of training a model to classify not suitable/safe for work (NSFW) images, particularly pornographic images, using convolutional neural network (CNN). It was implemented with Caffe. Users need to supply the training data, positive being the NSFW images, and negative being the suitable/safe for work (SFW) images, to train the model. The model takes an image as the input, and output a score between 0 and 1.

The codes are available on Github, with the about the installation.


Continue reading “NSFW Image Classification”

Linking Fundamental Physics to Deep Learning

Ever since Mehta and Schwab laid out the relationship between restricted Boltzmann machines (RBM) and deep learning mathematically (see my previous entry), scientists have been discussing why deep learning works so well. Recently, Henry Lin and Max Tegmark put a preprint on arXiv (arXiv:1609.09225), arguing that deep learning works because it captures a few essential physical laws and properties. Tegmark is a cosmologist.

Physical laws are simple in a way that a few properties, such as locality, symmetry, hierarchy etc., lead to large-scale, universal, and often complex phenomena. A lot of machine learning algorithms, including deep learning algorithms, have deep relations with formalisms outlined in statistical mechanics.

A lot of machine learning algorithms are basically probability theory. They outlined a few types of algorithms that seek various types of probabilities. They related the probabilities to Hamiltonians in many-body systems.

They argued why neural networks can approximate functions (polynomials) so well, giving a simple neural network performing multiplication. With central limit theorem or Jaynes’ arguments (see my previous entry), a lot of multiplications, they said, can be approximated by low-order polynomial Hamiltonian. This is like a lot of many-body systems that can be approximated by 4-th order Landau-Ginzburg-Wilson (LGW) functional.

Properties such as locality reduces the number of hyper-parameters needed because it restricts to interactions among close proximities. Symmetry further reduces it, and also computational complexities. Symmetry and second order phase transition make scaling hypothesis possible, leading to the use of the tools such as renormalization group (RG). As many people have been arguing, deep learning resembles RG because it filters out unnecessary information and maps out the crucial features. Tegmark use classifying cats vs. dogs as an example, as in retrieving temperatures of a many-body systems using RG procedure. They gave a counter-example to Schwab’s paper with the probabilities cannot be preserved by RG procedure, but while it is sound, but it is not the point of the RG procedure anyway.

They also discussed about the no-flattening theorems for neural networks.

Continue reading “Linking Fundamental Physics to Deep Learning”

SOCcer: Computerized Coding In Epidemiology

There are many tasks that involve coding, for example, putting kids into groups according to their age, labeling the webpages about their kinds, or putting students in Hogwarts into four colleges… And researchers or lawyers need to code people, according to their filled-in information, into occupations. Melissa Friesen, an investigator in Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NCI), National Institutes of Health (NIH), saw the need of large-scale coding. Many researchers are dealing with big data concerning epidemiology. She led a research project, in collaboration with Office of Intramural Research (OIR), Center for Information Technology (CIT), National Institutes of Health (NIH), to develop an artificial intelligence system to cope with the problem. This leads to a publicly available tool called SOCcer, an acronym for “Standardized Occupation Coding for Computer-assisted Epidemiological Research.” (URL:

The system was initially developed in an attempt to find the correlation between the onset of cancers and other diseases and the occupation. “The application is not intended to replace expert coders, but rather to prioritize which job descriptions would benefit most from expert review,” said Friesen in an interview. She mainly works with Daniel Russ in CIT.

SOCcer takes job title, industry codes (in terms of SIC, Standard Industrial Classification), and job duties, and gives an occupational code called SOC 2010 (Standard Occupational Classification), used by U. S. federal government agencies. The data involves short text, often messy. There are 840 codes in SOC 2010 systems. Conventional natural language processing (NLP) methods may not apply. Friesen, Russ, and Kwan-Yuet (Stephen) Ho (also in OIR, CIT; a CSRA staff) use fuzzy logic, and maximum entropy (maxent) methods, with some feature engineering, to build various classifiers. These classifiers are aggregated together, as in stacked generalization (see my previous entry), using logistic regression, to give a final score.

SOCcer has a companion software, called SOCAssign, for expert coders to prioritize the codings. It was awarded with DCEG Informatics Tool Challenge 2015. SOCcer itself was awarded in 2016. And the SOCcer team was awarded for Scientific Award of Merit by CIT/OCIO in 2016 as well (see this). Their work was published in Occup. Environ. Med.


Continue reading “SOCcer: Computerized Coding In Epidemiology”

Law Prediction

On August 1, my friends and I attended a meetup host by DC Data Science, titled “Predicting and Understanding Law with Machine Learning.” The speaker was John Nay, a Ph.D. candidate in Vanderbilt University. He presented his research which is at an application of natural language processing on legal enactment documents.

His talk was very interesting, from the similarity of presidents and the chambers, to the kind of topics each party focused on. He used a variety of techniques such as Word2Vec, STM (structural topic modeling), and some common textual and statistical analysis. It is quite a comprehensive study.

His work is demonstrated at His work can be found in arXiv.

Continue reading “Law Prediction”

Book on NLP Annotation

It is really a pleasure to participate in DC NLP meetup group. And it was so nice to listen to Ari Chanen’s talk on his application of ensembles of Word2Vec, and Mark Lidd’s on the product of DataFission.

Before the talks, the co-organizers gave out sponsored books. I am so lucky to get James Pustejovsky’s and Amber Stubb’s Natual Language Annotation for Machine Learning. It is a book on annotating NLP corpuses, and its process cycle. I never annotate the corpus, but I know it is an important task, however tedious it is.


Continue reading “Book on NLP Annotation”

Create a free website or blog at

Up ↑