The Python package for text mining shorttext has a new release: 0.5.4. It can be installed by typing in the command line:

pip install -U shorttext

For some people, you may need to install it from “root”, i.e., adding sudo in front of the command. Since the version 0.5 (including releases 0.5.1 and 0.5.4), there have been substantial addition of functionality, mostly about comparisons between short phrases without running a supervised or unsupervised machine learning algorithm, but calculating the “similarity” with various metrics, including:

• soft Jaccard score (the same kind of fuzzy scores based on edit distance in SOCcer),
• Word Mover’s distance (WMD, detailedly described in a previous post), and
• Jaccard index due to word-embedding model.

For the soft Jaccard score due to edit distance, we can call it by:

>>> from shorttext.metrics.dynprog import soft_jaccard_score
>>> soft_jaccard_score(['book', 'seller'], ['blok', 'sellers'])     # gives 0.6716417910447762
>>> soft_jaccard_score(['police', 'station'], ['policeman'])        # gives 0.2857142857142858

The core of this code was written in C, and interfaced to Python using SWIG.

For the Word Mover’s Distance (WMD), while the source codes are the same as my previous post, it can now be called directly. First, load the modules and the word-embedding model:

>>> from shorttext.metrics.wasserstein import word_mover_distance
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')

And compute the WMD with a single function:

>>> word_mover_distance(['police', 'station'], ['policeman'], wvmodel)                      # gives 3.060708999633789
>>> word_mover_distance(['physician', 'assistant'], ['doctor', 'assistants'], wvmodel)      # gives 2.276337146759033

And the Jaccard index due to cosine distance in Word-embedding model can be called like this:

>>> from shorttext.metrics.embedfuzzy import jaccardscore_sents
>>> jaccardscore_sents('doctor', 'physician', wvmodel)   # gives 0.6401538990056869
>>> jaccardscore_sents('chief executive', 'computer cluster', wvmodel)   # gives 0.0022515450768836143
>>> jaccardscore_sents('topological data', 'data of topology', wvmodel)   # gives 0.67588977344632573

Most new functions can be found in this tutorial.

And there are some minor bugs fixed.

On November 21, 2016, the Python package shorttext’ was published. Until today, more than seven versions have been published. There have been a drastic architecture change, but the overall purpose is still the same, as summarized in the first introduction entry:

This package shorttext‘ was designed to tackle all these problems… It contains the following features:

• example data provided (including subject keywords and NIH RePORT);
• text preprocessing;
• pre-trained word-embedding support;
• gensim topic models (LDA, LSI, Random Projections) and autoencoder;
• topic model representation supported for supervised learning using scikit-learn;
• cosine distance classification; and
• neural network classification (including ConvNet, and C-LSTM).

And since the first version, there have been updates, as summarized in the documention (News):

## Version 0.3.3 (Apr 19, 2017)

• Deleted CNNEmbedVecClassifier.

## Version 0.3.2 (Mar 28, 2017)

• Bug fixed for gensim model I/O;
• Console scripts update;
• Neural networks up to Keras 2 standard (refer to this).

## Version 0.3.1 (Mar 14, 2017)

• Compact model I/O: all models are in single files;
• Implementation of stacked generalization using logistic regression.

## Version 0.2.1 (Feb 23, 2017)

• Removal attempts of loading GloVe model, as it can be run using gensim script;
• Confirmed compatibility of the package with tensorflow;
• Use of spacy for tokenization, instead of nltk;
• Use of stemming for Porter stemmer, instead of nltk;
• Removal of nltk dependencies;
• Simplifying the directory and module structures;
• Module packages updated.

Although there are still additions that I would love to add, but it would not change the overall architecture. I may add some more supervised learning algorithms, but under the same network. The upcoming big additions will be generative models or seq2seq models, but I do not see them coming in the short term. I will add corpuses.

I may add tutorials if I have time.

I am thankful that there is probably some external collaboration with other Python packages. Some people have already made some useful contributions. It will be updated if more things are confirmed.

I have worked a lot on text categorization in the past few months, and I started to get bored. I started to become more interested in generative models, and generating texts.

Generative models are not new. Topic models such as LDA, or STM are generative models. However, I have been using the topic vectors or other topic models such as LDA2Vec as the feature of another supervised algorithm. And it is basically the design of my shorttext package.

I attended a meetup event held by DC Data Science and Data Education DC. The speaker, Daewoo Chong, is a senior Data Scientist at Booz Allen Hamilton. He talked about chatbot, building on RNN models on characters. His talk was not exactly about generative models, but it is indeed about generating texts. With the sophistication of GANs (see my entry on GAN and WGAN), it will surely be my next focus of my toy projects.

Ran Chen wrote a blog on his company homepage about natural language generation in his system, Trulia.

And there are a few GAN applications on text:

• “Generating Text via Adversarial Learning” [PDF]
• Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu, “SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient,” arXiv:1609.05473 [arXiv]
• Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, Dan Jurafsky, “Adversarial Learning for Neural Dialogue Generation,” arXiv:1701.06547 [arXiv]
• Matt J. Kusner, José Miguel Hernández-Lobato, “GANs for sequence of discrete elements with the Gumbel-softmax distribution,” arXiv:1611.04051 [arXiv]
• David Pfau, Oriol Vinyals, “Connecting generative adversarial network and actor-critic methods,” arXiv:1610.01945 [arXiv]
• Xuerong Xiao, “Text Generation usingGenerative Adversarial Training” [PDF]

The first presidential debate 2016 was held on September 26, 2016 in Hofstra University in New York. An interesting analysis will be the literacy level demonstrated by the two candidates using Flesch readability ease and Flesch-Kincaid grade level, demonstrated in my previous blog entry and my Github: stephenhky/PyReadability.

First, we need to get the transcript of the debate, which can be found in an article in New York Times. Copy and paste the text into a file called first_debate_transcript.txt. Then we want to extract out speech of each person. To do this, store the following Python code in first_debate_segment.py.

# Trump and Clinton 1st debate on Sept 26, 2016

from nltk import word_tokenize
from collections import defaultdict
import re

def untokenize(words):
"""
Untokenizing a text undoes the tokenizing operation, restoring
punctuation and spaces to the places that people expect them to be.
Ideally, untokenize(tokenize(text)) should be identical to text,
except for line breaks.
"""
text = ' '.join(words)
step1 = text.replace(" ", '"').replace(" ''", '"').replace('. . .',  '...')
step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
step3 = re.sub(r' ([.,:;?!%]+)([ \'"])', r"\1\2", step2)
step4 = re.sub(r' ([.,:;?!%]+)\$', r"\1", step3)
step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
"can not", "cannot")
step6 = step5.replace("  ", " '")
return step6.strip()

ignored_phrases = ['(APPLAUSE)', '(CROSSTALK)']
persons = ['TRUMP', 'CLINTON', 'HOLT']
fin = open('first_debate_transcript.txt', 'rb')
fin.close()

lines = filter(lambda s: len(s)>0, map(lambda s: s.strip(), lines))
speeches = defaultdict(lambda : '')
person = None

for line in lines:
tokens = word_tokenize(line.strip())
ignore_colon = False
for token in tokens:
if token in ignored_phrases:
pass
elif token in persons:
person = token
ignore_colon = True
elif token == ':':
ignore_colon = False
else:
speeches[person] += ' ' + untokenize(added_tokens)

for person in persons:
fout = open('speeches_'+person+'.txt', 'wb')
fout.write(speeches[person])
fout.close()


There is an untokenize function adapted from a code in StackOverflow. This segmented the transcript into the individual speech of Lester Holt (the host of the debate), Donald Trump (GOP presidential candidate), and Hillary Clinton (DNC presidential candidate) in separate files. Then, on UNIX or Linux command line, run score_readability.py on each person’s script, by, for example, for Holt’s speech,

python score_readability.py speeches_HOLT.txt --utf8

Beware that it is encoded in UTF-8. For Lester Holt, we have

Word count = 1935
Sentence count = 157
Syllable count = 2732
Flesch-Kincaid grade level = 5.87694629602

For Donald Trump,

Word count = 8184
Sentence count = 693
Syllable count = 10665
Flesch-Kincaid grade level = 4.3929136992

And for Hillary Clinton,

Word count = 6179
Sentence count = 389
Syllable count = 8395
Flesch-Kincaid grade level = 6.63676650035

Apparently, compared to Donald Trump, Hillary Clinton has a higher literary level, but her speech is less easy to understand.

Recalling from my previous entry, for Shakespeare’s MacBeth, the Flesch readability ease is 112.278048591, and Flesch-Kincard grade level 0.657934056288; for King James Version Bible (KJV), they are 79.6417489428 and 9.0085275366 respectively.

This is just a simple text analytics. However, the content is not analyzed here. Augustine of Hippo wrote in his Book IV of On Christian Teaching (Latin: De doctrina christiana) about rhetoric and eloquence:

“… wisdom without eloquence is of little value to the society… eloquence without wisdom is… a great nuisance, and never beneficial.” — Augustine of Hippo, Book IV of On Christian Teaching

There are many tasks that involve coding, for example, putting kids into groups according to their age, labeling the webpages about their kinds, or putting students in Hogwarts into four colleges… And researchers or lawyers need to code people, according to their filled-in information, into occupations. Melissa Friesen, an investigator in Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NCI), National Institutes of Health (NIH), saw the need of large-scale coding. Many researchers are dealing with big data concerning epidemiology. She led a research project, in collaboration with Office of Intramural Research (OIR), Center for Information Technology (CIT), National Institutes of Health (NIH), to develop an artificial intelligence system to cope with the problem. This leads to a publicly available tool called SOCcer, an acronym for “Standardized Occupation Coding for Computer-assisted Epidemiological Research.” (URL: http://soccer.nci.nih.gov/soccer/)

The system was initially developed in an attempt to find the correlation between the onset of cancers and other diseases and the occupation. “The application is not intended to replace expert coders, but rather to prioritize which job descriptions would benefit most from expert review,” said Friesen in an interview. She mainly works with Daniel Russ in CIT.

SOCcer takes job title, industry codes (in terms of SIC, Standard Industrial Classification), and job duties, and gives an occupational code called SOC 2010 (Standard Occupational Classification), used by U. S. federal government agencies. The data involves short text, often messy. There are 840 codes in SOC 2010 systems. Conventional natural language processing (NLP) methods may not apply. Friesen, Russ, and Kwan-Yuet (Stephen) Ho (also in OIR, CIT; a CSRA staff) use fuzzy logic, and maximum entropy (maxent) methods, with some feature engineering, to build various classifiers. These classifiers are aggregated together, as in stacked generalization (see my previous entry), using logistic regression, to give a final score.

SOCcer has a companion software, called SOCAssign, for expert coders to prioritize the codings. It was awarded with DCEG Informatics Tool Challenge 2015. SOCcer itself was awarded in 2016. And the SOCcer team was awarded for Scientific Award of Merit by CIT/OCIO in 2016 as well (see this). Their work was published in Occup. Environ. Med.

It is really a pleasure to participate in DC NLP meetup group. And it was so nice to listen to Ari Chanen’s talk on his application of ensembles of Word2Vec, and Mark Lidd’s on the product of DataFission.

Before the talks, the co-organizers gave out sponsored books. I am so lucky to get James Pustejovsky’s and Amber Stubb’s Natual Language Annotation for Machine Learning. It is a book on annotating NLP corpuses, and its process cycle. I never annotate the corpus, but I know it is an important task, however tedious it is.

The topic of word embedding algorithms has been one of the interests of this blog, as in this entry, with Word2Vec [Mikilov et. al. 2013] as one of the main examples. It is a great tool for text mining, (for example, see [Czerny 2015],) as it reduces the dimensions needed (compared to bag-of-words model). As an algorithm borrowed from computer vision, a lot of these algorithms use deep learning methods to train the model, while it was not exactly sure why it works. Despite that, there are many articles talking about how to train the model. [Goldberg & Levy 2014, Rong 2014 etc.] Addition and subtraction of the word vectors show amazing relationships that carry semantic values, as I have shown in my previous blog entry. [Ho 2015]

However, Tomas Mikolov is no longer working in Google, making the development of this algorithm discontinued. As a follow-up of their work, Stanford NLP group later proposed a model, called GloVe (Global Vectors), that embeds word vectors using probabilistic methods. [Pennington, Socher & Manning 2014] It can be implemented in the package glove-python in python, and text2vec in R (or see their CRAN post).  Their paper is neatly written, and a highly recommended read.

To explain the theory of GloVe, we start with some basic probabilistic picture in basic natural language processing (NLP). We suppose the relation between the words occur in certain text windows within a corpus, but the details are not important here. Assume that $i$, $j$, and $k$ are three words, and the conditional probability $P_{ik}$ is defined as

$P_{ij} = P(j | i) = \frac{X_{ij}}{X_i}$,

where $X$‘s are the counts, and similarly for $P_{jk}$. And we are interested in the following ratio:

$F(w_i, w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$.

The tilde means “context,” but we will later assume it is also a word. Citing the example from their paper, take $i$ as ice, and $j$ as steam. if $k$ is solid, then the ratio is expected to be large; or if $k$ is gas, then it is expected to be low. But if $k$ is water, which are related to both, or fashion, which is related to none, then the ratio is expected to be approximately 1.

And the addition and subtraction in Word2Vec is similar to this. We want the ratio to be like the subtraction as in Word2Vec (and multiplication as in addition), then we should modify the function $F$ such that,

$F(w_i - w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$.

On the other hand, the input arguments of $F$ are vectors, but the output is a scalar. We avoid the issue by making the input argument as a dot product,

$F( (w_i - w_j)^T \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$.

In NLP, the word-word co-occurrence matrices are symmetric, and our function $F$ should also be invariant under switching the labeling. We first require $F$ is be a homomorphism,

$F((w_i - w_j)^T \tilde{w}_k) = \frac{F(w_i^T \tilde{w}_k) }{ F(w_j^T \tilde{w}_k)}$,

where we define,

$F(w_i^T \tilde{w}_k) = P_{ik} = \frac{X_{ik}}{X_i}$.

It is clear that $F$ is an exponential function, but to ensure symmetry, we further define:

$w_i^T \tilde{w}_k + b_i + \tilde{b}_k = \log X_{ik}$.

As a result of this equation, the authors defined the following cost function to optimize for GloVe model:

$J = \sum_{i, j=1}^V f(X_{ij}) \left( w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ik} \right)^2$,

where $w_j$, $\tilde{w}_j$, $b_i$, and $\tilde{b}_j$ are parameters to learn. $f(x)$ is a weighting function. (Refer the details to the paper.) [Pennington, Socher & Manning 2014]

As Radim Řehůřek said in his blog entry, [Řehůřek 2014] it is a neat paper, but their evaluation is crappy.

This theory explained why certain similar relations can be achieved, such as Paris – France is roughly equal to Beijing – China, as both can be transformed to the ratio in the definition of $F$ above.

It is a neat paper, as it employs optimization theory and probability theory, without any dark box deep learning.

The emotional flows of stories are important to engage the readers. Skillful writers grasp this very well by natural instinct. There are theories about this, called folkloristics. However, is there a way to see the flows in a graph? Linear algebra and natural language processing (NLP) kick in.

Andrew Reagan at the Computational Story Lab, University of Vermont, together with his colleagues and collaborators, did a numerical studies about this. [Reagan et. al., 2016] Their paper is now on the arXiv. He prepared a set of words with scores that quantitatively describe their sentiments, as in sentiment analysis. He then went through the text with a sliding window to measure the sentiments. Then for each book, there is a vector of a time series of these sentiment scores. For example, using this method, the plot of the emotional scores, or the emotional arc, of J. K. Rowling’s Harry Potter and the Deathly Hallows is as shown in the following plot: [Reagan et. al., 2016]

They did the same thing with other English fictions in the Project Gutenberg Corpus, giving a vector of these emotional scores for each fiction. They performed a principal component analysis (PCA) for all these books (represented by a matrix containing all vectors). PCA is a common dimensionality reduction techniques, and also useful for information retrieval (IR) in another name called latent semantic analysis (LSA). Reagan and his colleagues identify six major components of these emotional arcs, as shown below: [Reagan et. al., 2016]

These computational studies on fictions further reinforce our common belief that (good-selling) fictions do have resonating themes to keep the readers.

Previously, I wrote an entry on text mining on R and Python, and did a comparison. However, the text mining package employed was tm for R. But it has some problems:

1. The syntax is not natural for an experienced R users.
2. tm uses simple_triplet_matrix from the slam library for document-term matrix (DTM) and term-occurrence matrix (TCM), which is not as widely used as dgCMatrix from the Matrix library.

Tommy Jones, a Ph.D. student in George Mason University, and a data scientist at Impact Research, developed an alternative text mining package called textmineR. He presented in a Stat Prog DC Meetup on April 27, 2016. It employed a better syntax, and dgCMatrix. All in all, it is a wrapper for a lot of existing R packages to facilitate the text mining process, like creating DTM matrices with stopwords or appropriate stemming/lemmatizing functions. Here is a sample code to create a DTM with the example from the previous entry:

library(tm)
library(textmineR)

texts <- c('I love Python.',
'R is good for analytics.',
'Mathematics is fun.')

dtm<-CreateDtm(texts,
doc_names = c(1:length(texts)),
ngram_window = c(1, 1),
stopword_vec = c(tm::stopwords('english'), tm::stopwords('SMART')),
lower = TRUE,
remove_punctuation = TRUE,
remove_numbers = TRUE
)


The DTM is a sparse matrix:

3 x 6 sparse Matrix of class &amp;quot;dgCMatrix&amp;quot;
analytics fun mathematics good python love
1         .   .           .    .      1    1
2         1   .           .    1      .    .
3         .   1           1    .      .    .


On the other hand, it wraps text2vec, an R package that wraps the word-embedding algorithm named gloVe. And it wraps a number of topic modeling algorithms, such as latent Dirichlet allocation (LDA) and correlated topic models (CTM).

In addition, it contains a parallel computing loop function called TmParallelApply, analogous to the original R parallel loop function mclapply, but TmParallelApply works on Windows as well.

textmineR is an open-source project, with source code available on github, which contains his example codes.