The Python package for text mining shorttext has a new release: 0.5.4. It can be installed by typing in the command line:

pip install -U shorttext

For some people, you may need to install it from “root”, i.e., adding sudo in front of the command. Since the version 0.5 (including releases 0.5.1 and 0.5.4), there have been substantial addition of functionality, mostly about comparisons between short phrases without running a supervised or unsupervised machine learning algorithm, but calculating the “similarity” with various metrics, including:

• soft Jaccard score (the same kind of fuzzy scores based on edit distance in SOCcer),
• Word Mover’s distance (WMD, detailedly described in a previous post), and
• Jaccard index due to word-embedding model.

For the soft Jaccard score due to edit distance, we can call it by:

>>> from shorttext.metrics.dynprog import soft_jaccard_score
>>> soft_jaccard_score(['book', 'seller'], ['blok', 'sellers'])     # gives 0.6716417910447762
>>> soft_jaccard_score(['police', 'station'], ['policeman'])        # gives 0.2857142857142858

The core of this code was written in C, and interfaced to Python using SWIG.

For the Word Mover’s Distance (WMD), while the source codes are the same as my previous post, it can now be called directly. First, load the modules and the word-embedding model:

>>> from shorttext.metrics.wasserstein import word_mover_distance
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')

And compute the WMD with a single function:

>>> word_mover_distance(['police', 'station'], ['policeman'], wvmodel)                      # gives 3.060708999633789
>>> word_mover_distance(['physician', 'assistant'], ['doctor', 'assistants'], wvmodel)      # gives 2.276337146759033

And the Jaccard index due to cosine distance in Word-embedding model can be called like this:

>>> from shorttext.metrics.embedfuzzy import jaccardscore_sents
>>> jaccardscore_sents('doctor', 'physician', wvmodel)   # gives 0.6401538990056869
>>> jaccardscore_sents('chief executive', 'computer cluster', wvmodel)   # gives 0.0022515450768836143
>>> jaccardscore_sents('topological data', 'data of topology', wvmodel)   # gives 0.67588977344632573

Most new functions can be found in this tutorial.

And there are some minor bugs fixed.

The first presidential debate 2016 was held on September 26, 2016 in Hofstra University in New York. An interesting analysis will be the literacy level demonstrated by the two candidates using Flesch readability ease and Flesch-Kincaid grade level, demonstrated in my previous blog entry and my Github: stephenhky/PyReadability.

First, we need to get the transcript of the debate, which can be found in an article in New York Times. Copy and paste the text into a file called first_debate_transcript.txt. Then we want to extract out speech of each person. To do this, store the following Python code in first_debate_segment.py.

# Trump and Clinton 1st debate on Sept 26, 2016

from nltk import word_tokenize
from collections import defaultdict
import re

def untokenize(words):
"""
Untokenizing a text undoes the tokenizing operation, restoring
punctuation and spaces to the places that people expect them to be.
Ideally, untokenize(tokenize(text)) should be identical to text,
except for line breaks.
"""
text = ' '.join(words)
step1 = text.replace(" ", '"').replace(" ''", '"').replace('. . .',  '...')
step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
step3 = re.sub(r' ([.,:;?!%]+)([ \'"])', r"\1\2", step2)
step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3) step5 = step4.replace(" '", "'").replace(" n't", "n't").replace( "can not", "cannot") step6 = step5.replace("  ", " '") return step6.strip() ignored_phrases = ['(APPLAUSE)', '(CROSSTALK)'] persons = ['TRUMP', 'CLINTON', 'HOLT'] fin = open('first_debate_transcript.txt', 'rb') lines = fin.readlines() fin.close() lines = filter(lambda s: len(s)>0, map(lambda s: s.strip(), lines)) speeches = defaultdict(lambda : '') person = None for line in lines: tokens = word_tokenize(line.strip()) ignore_colon = False added_tokens = [] for token in tokens: if token in ignored_phrases: pass elif token in persons: person = token ignore_colon = True elif token == ':': ignore_colon = False else: added_tokens += [token] speeches[person] += ' ' + untokenize(added_tokens) for person in persons: fout = open('speeches_'+person+'.txt', 'wb') fout.write(speeches[person]) fout.close()  There is an untokenize function adapted from a code in StackOverflow. This segmented the transcript into the individual speech of Lester Holt (the host of the debate), Donald Trump (GOP presidential candidate), and Hillary Clinton (DNC presidential candidate) in separate files. Then, on UNIX or Linux command line, run score_readability.py on each person’s script, by, for example, for Holt’s speech, python score_readability.py speeches_HOLT.txt --utf8 Beware that it is encoded in UTF-8. For Lester Holt, we have Word count = 1935 Sentence count = 157 Syllable count = 2732 Flesch readability ease = 74.8797052289 Flesch-Kincaid grade level = 5.87694629602 For Donald Trump, Word count = 8184 Sentence count = 693 Syllable count = 10665 Flesch readability ease = 84.6016324536 Flesch-Kincaid grade level = 4.3929136992 And for Hillary Clinton, Word count = 6179 Sentence count = 389 Syllable count = 8395 Flesch readability ease = 75.771973015 Flesch-Kincaid grade level = 6.63676650035 Apparently, compared to Donald Trump, Hillary Clinton has a higher literary level, but her speech is less easy to understand. Recalling from my previous entry, for Shakespeare’s MacBeth, the Flesch readability ease is 112.278048591, and Flesch-Kincard grade level 0.657934056288; for King James Version Bible (KJV), they are 79.6417489428 and 9.0085275366 respectively. This is just a simple text analytics. However, the content is not analyzed here. Augustine of Hippo wrote in his Book IV of On Christian Teaching (Latin: De doctrina christiana) about rhetoric and eloquence: “… wisdom without eloquence is of little value to the society… eloquence without wisdom is… a great nuisance, and never beneficial.” — Augustine of Hippo, Book IV of On Christian Teaching Ever since Mehta and Schwab laid out the relationship between restricted Boltzmann machines (RBM) and deep learning mathematically (see my previous entry), scientists have been discussing why deep learning works so well. Recently, Henry Lin and Max Tegmark put a preprint on arXiv (arXiv:1609.09225), arguing that deep learning works because it captures a few essential physical laws and properties. Tegmark is a cosmologist. Physical laws are simple in a way that a few properties, such as locality, symmetry, hierarchy etc., lead to large-scale, universal, and often complex phenomena. A lot of machine learning algorithms, including deep learning algorithms, have deep relations with formalisms outlined in statistical mechanics. A lot of machine learning algorithms are basically probability theory. They outlined a few types of algorithms that seek various types of probabilities. They related the probabilities to Hamiltonians in many-body systems. They argued why neural networks can approximate functions (polynomials) so well, giving a simple neural network performing multiplication. With central limit theorem or Jaynes’ arguments (see my previous entry), a lot of multiplications, they said, can be approximated by low-order polynomial Hamiltonian. This is like a lot of many-body systems that can be approximated by 4-th order Landau-Ginzburg-Wilson (LGW) functional. Properties such as locality reduces the number of hyper-parameters needed because it restricts to interactions among close proximities. Symmetry further reduces it, and also computational complexities. Symmetry and second order phase transition make scaling hypothesis possible, leading to the use of the tools such as renormalization group (RG). As many people have been arguing, deep learning resembles RG because it filters out unnecessary information and maps out the crucial features. Tegmark use classifying cats vs. dogs as an example, as in retrieving temperatures of a many-body systems using RG procedure. They gave a counter-example to Schwab’s paper with the probabilities cannot be preserved by RG procedure, but while it is sound, but it is not the point of the RG procedure anyway. They also discussed about the no-flattening theorems for neural networks. There are many tasks that involve coding, for example, putting kids into groups according to their age, labeling the webpages about their kinds, or putting students in Hogwarts into four colleges… And researchers or lawyers need to code people, according to their filled-in information, into occupations. Melissa Friesen, an investigator in Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NCI), National Institutes of Health (NIH), saw the need of large-scale coding. Many researchers are dealing with big data concerning epidemiology. She led a research project, in collaboration with Office of Intramural Research (OIR), Center for Information Technology (CIT), National Institutes of Health (NIH), to develop an artificial intelligence system to cope with the problem. This leads to a publicly available tool called SOCcer, an acronym for “Standardized Occupation Coding for Computer-assisted Epidemiological Research.” (URL: http://soccer.nci.nih.gov/soccer/) The system was initially developed in an attempt to find the correlation between the onset of cancers and other diseases and the occupation. “The application is not intended to replace expert coders, but rather to prioritize which job descriptions would benefit most from expert review,” said Friesen in an interview. She mainly works with Daniel Russ in CIT. SOCcer takes job title, industry codes (in terms of SIC, Standard Industrial Classification), and job duties, and gives an occupational code called SOC 2010 (Standard Occupational Classification), used by U. S. federal government agencies. The data involves short text, often messy. There are 840 codes in SOC 2010 systems. Conventional natural language processing (NLP) methods may not apply. Friesen, Russ, and Kwan-Yuet (Stephen) Ho (also in OIR, CIT; a CSRA staff) use fuzzy logic, and maximum entropy (maxent) methods, with some feature engineering, to build various classifiers. These classifiers are aggregated together, as in stacked generalization (see my previous entry), using logistic regression, to give a final score. SOCcer has a companion software, called SOCAssign, for expert coders to prioritize the codings. It was awarded with DCEG Informatics Tool Challenge 2015. SOCcer itself was awarded in 2016. And the SOCcer team was awarded for Scientific Award of Merit by CIT/OCIO in 2016 as well (see this). Their work was published in Occup. Environ. Med. The topic of word embedding algorithms has been one of the interests of this blog, as in this entry, with Word2Vec [Mikilov et. al. 2013] as one of the main examples. It is a great tool for text mining, (for example, see [Czerny 2015],) as it reduces the dimensions needed (compared to bag-of-words model). As an algorithm borrowed from computer vision, a lot of these algorithms use deep learning methods to train the model, while it was not exactly sure why it works. Despite that, there are many articles talking about how to train the model. [Goldberg & Levy 2014, Rong 2014 etc.] Addition and subtraction of the word vectors show amazing relationships that carry semantic values, as I have shown in my previous blog entry. [Ho 2015] However, Tomas Mikolov is no longer working in Google, making the development of this algorithm discontinued. As a follow-up of their work, Stanford NLP group later proposed a model, called GloVe (Global Vectors), that embeds word vectors using probabilistic methods. [Pennington, Socher & Manning 2014] It can be implemented in the package glove-python in python, and text2vec in R (or see their CRAN post). Their paper is neatly written, and a highly recommended read. To explain the theory of GloVe, we start with some basic probabilistic picture in basic natural language processing (NLP). We suppose the relation between the words occur in certain text windows within a corpus, but the details are not important here. Assume that $i$, $j$, and $k$ are three words, and the conditional probability $P_{ik}$ is defined as $P_{ij} = P(j | i) = \frac{X_{ij}}{X_i}$, where $X$‘s are the counts, and similarly for $P_{jk}$. And we are interested in the following ratio: $F(w_i, w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$. The tilde means “context,” but we will later assume it is also a word. Citing the example from their paper, take $i$ as ice, and $j$ as steam. if $k$ is solid, then the ratio is expected to be large; or if $k$ is gas, then it is expected to be low. But if $k$ is water, which are related to both, or fashion, which is related to none, then the ratio is expected to be approximately 1. And the addition and subtraction in Word2Vec is similar to this. We want the ratio to be like the subtraction as in Word2Vec (and multiplication as in addition), then we should modify the function $F$ such that, $F(w_i - w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$. On the other hand, the input arguments of $F$ are vectors, but the output is a scalar. We avoid the issue by making the input argument as a dot product, $F( (w_i - w_j)^T \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$. In NLP, the word-word co-occurrence matrices are symmetric, and our function $F$ should also be invariant under switching the labeling. We first require $F$ is be a homomorphism, $F((w_i - w_j)^T \tilde{w}_k) = \frac{F(w_i^T \tilde{w}_k) }{ F(w_j^T \tilde{w}_k)}$, where we define, $F(w_i^T \tilde{w}_k) = P_{ik} = \frac{X_{ik}}{X_i}$. It is clear that $F$ is an exponential function, but to ensure symmetry, we further define: $w_i^T \tilde{w}_k + b_i + \tilde{b}_k = \log X_{ik}$. As a result of this equation, the authors defined the following cost function to optimize for GloVe model: $J = \sum_{i, j=1}^V f(X_{ij}) \left( w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ik} \right)^2$, where $w_j$, $\tilde{w}_j$, $b_i$, and $\tilde{b}_j$ are parameters to learn. $f(x)$ is a weighting function. (Refer the details to the paper.) [Pennington, Socher & Manning 2014] As Radim Řehůřek said in his blog entry, [Řehůřek 2014] it is a neat paper, but their evaluation is crappy. This theory explained why certain similar relations can be achieved, such as Paris – France is roughly equal to Beijing – China, as both can be transformed to the ratio in the definition of $F$ above. It is a neat paper, as it employs optimization theory and probability theory, without any dark box deep learning. Previously, I wrote an entry on text mining on R and Python, and did a comparison. However, the text mining package employed was tm for R. But it has some problems: 1. The syntax is not natural for an experienced R users. 2. tm uses simple_triplet_matrix from the slam library for document-term matrix (DTM) and term-occurrence matrix (TCM), which is not as widely used as dgCMatrix from the Matrix library. Tommy Jones, a Ph.D. student in George Mason University, and a data scientist at Impact Research, developed an alternative text mining package called textmineR. He presented in a Stat Prog DC Meetup on April 27, 2016. It employed a better syntax, and dgCMatrix. All in all, it is a wrapper for a lot of existing R packages to facilitate the text mining process, like creating DTM matrices with stopwords or appropriate stemming/lemmatizing functions. Here is a sample code to create a DTM with the example from the previous entry: library(tm) library(textmineR) texts <- c('I love Python.', 'R is good for analytics.', 'Mathematics is fun.') dtm<-CreateDtm(texts, doc_names = c(1:length(texts)), ngram_window = c(1, 1), stopword_vec = c(tm::stopwords('english'), tm::stopwords('SMART')), lower = TRUE, remove_punctuation = TRUE, remove_numbers = TRUE )  The DTM is a sparse matrix: 3 x 6 sparse Matrix of class &amp;quot;dgCMatrix&amp;quot; analytics fun mathematics good python love 1 . . . . 1 1 2 1 . . 1 . . 3 . 1 1 . . .  On the other hand, it wraps text2vec, an R package that wraps the word-embedding algorithm named gloVe. And it wraps a number of topic modeling algorithms, such as latent Dirichlet allocation (LDA) and correlated topic models (CTM). In addition, it contains a parallel computing loop function called TmParallelApply, analogous to the original R parallel loop function mclapply, but TmParallelApply works on Windows as well. textmineR is an open-source project, with source code available on github, which contains his example codes. Biomedical research and clinical data are often collected on the same sample of data at different points in time. These data are called “longitudinal data.” (See the definition by BLS.) When performing supervised learning (e.g., SVM) on data of this kind, the impact of time-varying correlation of the features on the outcomes / predictions may be blurred. In order to smoothing out the temporal effect of the features, changes to the original learning algorithms are necessary. In a study conducted by Center for Information Technology (CIT), and National Institutes on Aging (NIA) in National Institutes of Health (NIH), with some clinical data as the training data, a longitudinal support vector regression (LSVR) algorithm was presented, and shown to outperform other machine learning methods. [Du et. al. 2015] Their results were published in IEEE BIBM (Bioinformatics and Biomedicine) conference. Their work is adapted from an earlier work by Chen and Bowman. [Chen & Bowman 2011] The dataset is a longitudinal, because it contains N patients with p features, taken at T points in time. Traditional support vector regression (SVR) is to solve the following optimization problem: $\text{min} \frac{1}{2} || \mathbf{w} ||^2$, where $\mathbf{w}$ is a hyperplane surface, under the constraints: $y_i - \mathbf{w^T} \Phi (\mathbf{x}_i) - b \leq \epsilon$, $\mathbf{w^T} \Phi (\mathbf{x}_i) + b - y_i \leq \epsilon$. However, in LSVR, the data points are more complicated. For each patient s, its features at time t is given by a vector $\mathbf{x}_s^{(t)}$. The first goal of LSVR is to assign each patient a T-by-p matrix $\tilde{\mathbf{x}}_s = \left[ \mathbf{x}_s^{(1)}, \mathbf{x}_s^{(2)}, \ldots, \mathbf{x}_s^{(T)} \right]^T$, and a T-by-1 vector $\tilde{y}_s =\left[ y_s^{(1)}, y_s^{(2)}, \ldots, y_s^{(T)} \right]^T$, with an unknown parameter vector $\mathbf{\beta} = \left[ 1, \beta_1, \beta_2, \ldots, \beta_{T-1} \right]^T$ such that the constraints becomes: $\tilde{y}_i^T \beta - \langle \mathbf{w^T}, \mathbf{\tilde{x}_i}^T \beta \rangle - b \leq \epsilon + \xi_i$, $\langle \mathbf{w^T}, \mathbf{\tilde{x}_i}^T \beta \rangle+ b -\tilde{y}_i^T \beta \leq \epsilon + \xi_i^{*}$, where $\xi_i$‘s are additional regularization parameters. The parameters $\beta$‘s can be found by iteratively quadratic optimization. The constraints are handled with Lagrangian’s multipliers. For details, please refer to [Du et. al. 2015]. This way decouples, or smoothes out, the temporal covariation within the patients. A better prediction can be made. Previously, I have went through heuristically the description of topology using homology groups in this entry. [Ho 2015] This is the essence of algebraic topology. We describe the topology using Betti numbers, the rank of the homolog groups. What they mean can be summarized as: [Bubenik 2015] “… homology in degree 0 describes the connectedness of the data; homology in degree 1 detects holes and tunnels; homology in degree 2 captures voids; and so on. ## Concept of Persistence However, in computational problems, it is the discrete points that we are dealing with. We formulate their connectedness through constructing complexes, as described by my another blog entry. [Ho 2015] From the Wolfram Demonstration that I quoted previously, connectedness depends on some parameters, such as the radii of points that are considered connected. Whether it is Čech Complex, RP complex, or Alpha complex, the idea is similar. With discrete data, therefore, there is no definite answer how the connectedness among the points are, as it depends on the parameters. Therefore, the concept of persistence has been developed to tackle this problem. This is the core concept for computational topology. There are a lot of papers about persistence, but the most famous work is done by Zomorodian and Carlsson, who algebraically studied it. [Zomorodian & Carlsson 2005] The idea is that as one increases the radii of points, the complexes change, and so do the homology groups. By varying the radii, we can observe which topology persists. From the diagram above, we can see that as the radii ε increase, the diagram becomes more connected. To understand the changes of homologies, there are a few ways. In the diagram above, barcodes represent the “life span” of a connected component as ε increases. The Betti numbers of a certain degree (0, 1, or 2 in this example) at a certain value of ε is the number of barcodes at that degree. For example, look at the left most vertical dashed line, $\beta_0=10$, as there are 10 barcodes existing for $H_0$. Note there are indeed 10 separate connected components. For the second leftmost vertical dashed line, $\beta_0=6$ (6 connected components), and $\beta_1=2$ (2 holes). Another way is using the persistence diagram, basically plotting the “birth” and “death” times of all the barcodes above. For an explanation of persistence diagram, please refer to this blog entry by Sebastien Bubeck, [Bubeck 2013] or the paper by Fasy et. al. [Fasy et. al. 2014] Another way to describe persistent topology is the persistence landscape. [Bubenik 2015] ## TDA Package in R There are a lot of tools to perform topological data analysis. Ayasdi Core is a famous one. There are open sources C++ libraries such as Dionysus, or PHAT. There is a Python binding for Dionysus too. There is a package in R that wraps Dionysus and PHAT, called TDA. To install it, simply open an R session, and enter install.package('TDA')  To load it, simply enter library(TDA)  We know that for a circle, $\beta_0=\beta_1=1$, as it has on connected components, and a hole. Prepare the circle and store it in X by the function circleUnif: X<- circleUnif(n=1000, r=1) plot(X)  Then we can see a 2-dimensional circle like this: To calculate the persistent homology, use the function gridDiag: diag.info<- gridDiag(X=X, FUN=kde, h=0.3, lim=cbind(c(-1, 1), c(-1, 1)), by=0.01, sublevel = FALSE, library = 'PHAT', printProgress=FALSE)  To plot the barcodes and persistence diagram, enter: par(mfrow=c(2,1)) plot(diag.info$diagram)
plot(diag.info\$diagram, barcode=TRUE)


In the plots, black refers to degree 0, and red refers to degree 1.

We can play the same game by adding a horizontal line to cut the circle into two halves:

X<- circleUnif(n=1000, r=1)
hl<- matrix(c(seq(-1, 1, 0.02), rep(0, 201)), ncol=2)
X<- rbind(X, hl)
plot(X)


And the barcodes and persistence diagram are:

We can try this with three-dimensional objects like sphere, or torus, but I never finished the calculation in reasonable speeds.

On October 14, 2015, I attended the regular meeting of the DCNLP meetup group, a group on natural language processing (NLP) in Washington, DC area. The talk was titled “Deep Learning for Question Answering“, spoken by Mr. Mohit Iyyer, a Ph.D. student in Department of Computer Science, University of Maryland (my alma mater!). He is a very good speaker.

I have no experience on deep learning at all although I did write a blog post remotely related. I even didn’t start training my first neural network until the next day after the talk. However, Mr. Iyyer explained what recurrent neural network (RNN), recursive neural network, and deep averaging network (DAN) are. This helped me a lot in order to understanding more about the principles of the famous word2vec model (which is something I am going to write about soon!). You can refer to his slides for more details. There are really a lot of talents in College Park, like another expert, Joe Yue Hei Ng, who is exploiting deep learning a lot as well.

The applications are awesome: with external knowledge to factual question answering, reasoning-based question answering, and visual question answering, with increasing order of challenging levels.

Mr. Iyyer and the participants discussed a lot about different packages. Mr. Iyyer uses Theano, a Python package for deep learning, which is good for model building and other analytical work. Some prefer Caffe. Some people, who are Java developers, also use deeplearning4j.

This meetup was a sacred one too, because it is the last time it was held in Stetsons Famous Bar & Grill at U Street, which is going to permanently close on Halloween this year. The group is eagerly looking for a new venue for the upcoming meetup. This meeting was a crowded one. I sincerely thank the organizers, Charlie Greenbacker and Liz Merkhofer, for hosting all these meetings, and Chris Phipps (a linguist from IBM Watson) for recording.