Simple Literary Analytics on Presidential Candidates in the First 2016 Presidential Debate

The first presidential debate 2016 was held on September 26, 2016 in Hofstra University in New York. An interesting analysis will be the literacy level demonstrated by the two candidates using Flesch readability ease and Flesch-Kincaid grade level, demonstrated in my previous blog entry and my Github: stephenhky/PyReadability.

First, we need to get the transcript of the debate, which can be found in an article in New York Times. Copy and paste the text into a file called first_debate_transcript.txt. Then we want to extract out speech of each person. To do this, store the following Python code in

# Trump and Clinton 1st debate on Sept 26, 2016

from nltk import word_tokenize
from collections import defaultdict
import re

# adopted from
def untokenize(words):
    Untokenizing a text undoes the tokenizing operation, restoring
    punctuation and spaces to the places that people expect them to be.
    Ideally, `untokenize(tokenize(text))` should be identical to `text`,
    except for line breaks.
    text = ' '.join(words)
    step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .',  '...')
    step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
    step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2)
    step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3)
    step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
         "can not", "cannot")
    step6 = step5.replace(" ` ", " '")
    return step6.strip()

ignored_phrases = ['(APPLAUSE)', '(CROSSTALK)']
persons = ['TRUMP', 'CLINTON', 'HOLT']
fin = open('first_debate_transcript.txt', 'rb')
lines = fin.readlines()

lines = filter(lambda s: len(s)>0, map(lambda s: s.strip(), lines))
speeches = defaultdict(lambda : '')
person = None

for line in lines:
    tokens = word_tokenize(line.strip())
    ignore_colon = False
    added_tokens = []
    for token in tokens:
        if token in ignored_phrases:
        elif token in persons:
            person = token
            ignore_colon = True
        elif token == ':':
            ignore_colon = False
            added_tokens += [token]
            speeches[person] += ' ' + untokenize(added_tokens)

for person in persons:
    fout = open('speeches_'+person+'.txt', 'wb')

There is an untokenize function adapted from a code in StackOverflow. This segmented the transcript into the individual speech of Lester Holt (the host of the debate), Donald Trump (GOP presidential candidate), and Hillary Clinton (DNC presidential candidate) in separate files. Then, on UNIX or Linux command line, run on each person’s script, by, for example, for Holt’s speech,

python speeches_HOLT.txt --utf8

Beware that it is encoded in UTF-8. For Lester Holt, we have

Word count = 1935
Sentence count = 157
Syllable count = 2732
Flesch readability ease = 74.8797052289
Flesch-Kincaid grade level = 5.87694629602

For Donald Trump,

Word count = 8184
Sentence count = 693
Syllable count = 10665
Flesch readability ease = 84.6016324536
Flesch-Kincaid grade level = 4.3929136992

And for Hillary Clinton,

Word count = 6179
Sentence count = 389
Syllable count = 8395
Flesch readability ease = 75.771973015
Flesch-Kincaid grade level = 6.63676650035

Apparently, compared to Donald Trump, Hillary Clinton has a higher literary level, but her speech is less easy to understand.

Recalling from my previous entry, for Shakespeare’s MacBeth, the Flesch readability ease is 112.278048591, and Flesch-Kincard grade level 0.657934056288; for King James Version Bible (KJV), they are 79.6417489428 and 9.0085275366 respectively.

This is just a simple text analytics. However, the content is not analyzed here. Augustine of Hippo wrote in his Book IV of On Christian Teaching (Latin: De doctrina christiana) about rhetoric and eloquence:

“… wisdom without eloquence is of little value to the society… eloquence without wisdom is… a great nuisance, and never beneficial.” — Augustine of Hippo, Book IV of On Christian Teaching


Continue reading “Simple Literary Analytics on Presidential Candidates in the First 2016 Presidential Debate”


Linking Fundamental Physics to Deep Learning

Ever since Mehta and Schwab laid out the relationship between restricted Boltzmann machines (RBM) and deep learning mathematically (see my previous entry), scientists have been discussing why deep learning works so well. Recently, Henry Lin and Max Tegmark put a preprint on arXiv (arXiv:1609.09225), arguing that deep learning works because it captures a few essential physical laws and properties. Tegmark is a cosmologist.

Physical laws are simple in a way that a few properties, such as locality, symmetry, hierarchy etc., lead to large-scale, universal, and often complex phenomena. A lot of machine learning algorithms, including deep learning algorithms, have deep relations with formalisms outlined in statistical mechanics.

A lot of machine learning algorithms are basically probability theory. They outlined a few types of algorithms that seek various types of probabilities. They related the probabilities to Hamiltonians in many-body systems.

They argued why neural networks can approximate functions (polynomials) so well, giving a simple neural network performing multiplication. With central limit theorem or Jaynes’ arguments (see my previous entry), a lot of multiplications, they said, can be approximated by low-order polynomial Hamiltonian. This is like a lot of many-body systems that can be approximated by 4-th order Landau-Ginzburg-Wilson (LGW) functional.

Properties such as locality reduces the number of hyper-parameters needed because it restricts to interactions among close proximities. Symmetry further reduces it, and also computational complexities. Symmetry and second order phase transition make scaling hypothesis possible, leading to the use of the tools such as renormalization group (RG). As many people have been arguing, deep learning resembles RG because it filters out unnecessary information and maps out the crucial features. Tegmark use classifying cats vs. dogs as an example, as in retrieving temperatures of a many-body systems using RG procedure. They gave a counter-example to Schwab’s paper with the probabilities cannot be preserved by RG procedure, but while it is sound, but it is not the point of the RG procedure anyway.

They also discussed about the no-flattening theorems for neural networks.

Continue reading “Linking Fundamental Physics to Deep Learning”

SOCcer: Computerized Coding In Epidemiology

There are many tasks that involve coding, for example, putting kids into groups according to their age, labeling the webpages about their kinds, or putting students in Hogwarts into four colleges… And researchers or lawyers need to code people, according to their filled-in information, into occupations. Melissa Friesen, an investigator in Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NCI), National Institutes of Health (NIH), saw the need of large-scale coding. Many researchers are dealing with big data concerning epidemiology. She led a research project, in collaboration with Office of Intramural Research (OIR), Center for Information Technology (CIT), National Institutes of Health (NIH), to develop an artificial intelligence system to cope with the problem. This leads to a publicly available tool called SOCcer, an acronym for “Standardized Occupation Coding for Computer-assisted Epidemiological Research.” (URL:

The system was initially developed in an attempt to find the correlation between the onset of cancers and other diseases and the occupation. “The application is not intended to replace expert coders, but rather to prioritize which job descriptions would benefit most from expert review,” said Friesen in an interview. She mainly works with Daniel Russ in CIT.

SOCcer takes job title, industry codes (in terms of SIC, Standard Industrial Classification), and job duties, and gives an occupational code called SOC 2010 (Standard Occupational Classification), used by U. S. federal government agencies. The data involves short text, often messy. There are 840 codes in SOC 2010 systems. Conventional natural language processing (NLP) methods may not apply. Friesen, Russ, and Kwan-Yuet (Stephen) Ho (also in OIR, CIT; a CSRA staff) use fuzzy logic, and maximum entropy (maxent) methods, with some feature engineering, to build various classifiers. These classifiers are aggregated together, as in stacked generalization (see my previous entry), using logistic regression, to give a final score.

SOCcer has a companion software, called SOCAssign, for expert coders to prioritize the codings. It was awarded with DCEG Informatics Tool Challenge 2015. SOCcer itself was awarded in 2016. And the SOCcer team was awarded for Scientific Award of Merit by CIT/OCIO in 2016 as well (see this). Their work was published in Occup. Environ. Med.


Continue reading “SOCcer: Computerized Coding In Epidemiology”

Temporal Influence in Data Analysis


Biomedical research and clinical data are often collected on the same sample of data at different points in time. These data are called “longitudinal data.” (See the definition by BLS.) When performing supervised learning (e.g., SVM) on data of this kind, the impact of time-varying correlation of the features on the outcomes / predictions may be blurred. In order to smoothing out the temporal effect of the features, changes to the original learning algorithms are necessary.

In a study conducted by Center for Information Technology (CIT), and National Institutes on Aging (NIA) in National Institutes of Health (NIH), with some clinical data as the training data, a longitudinal support vector regression (LSVR) algorithm was presented, and shown to outperform other machine learning methods. [Du et. al. 2015] Their results were published in IEEE BIBM (Bioinformatics and Biomedicine) conference. Their work is adapted from an earlier work by Chen and Bowman. [Chen & Bowman 2011] The dataset is a longitudinal, because it contains N patients with p features, taken at T points in time.

Traditional support vector regression (SVR) is to solve the following optimization problem:

\text{min} \frac{1}{2} || \mathbf{w} ||^2,

where \mathbf{w} is a hyperplane surface, under the constraints:

y_i - \mathbf{w^T} \Phi (\mathbf{x}_i) - b \leq \epsilon,
\mathbf{w^T} \Phi (\mathbf{x}_i) + b - y_i \leq \epsilon.

However, in LSVR, the data points are more complicated. For each patient s, its features at time t is given by a vector \mathbf{x}_s^{(t)}. The first goal of LSVR is to assign each patient a T-by-p matrix \tilde{\mathbf{x}}_s = \left[ \mathbf{x}_s^{(1)}, \mathbf{x}_s^{(2)}, \ldots, \mathbf{x}_s^{(T)} \right]^T, and a T-by-1 vector \tilde{y}_s =\left[ y_s^{(1)}, y_s^{(2)}, \ldots, y_s^{(T)} \right]^T, with an unknown parameter vector \mathbf{\beta} = \left[ 1, \beta_1, \beta_2, \ldots, \beta_{T-1} \right]^T such that the constraints becomes:

\tilde{y}_i^T \beta - \langle \mathbf{w^T}, \mathbf{\tilde{x}_i}^T \beta \rangle - b \leq \epsilon + \xi_i,
\langle \mathbf{w^T}, \mathbf{\tilde{x}_i}^T \beta \rangle+ b -\tilde{y}_i^T \beta \leq \epsilon + \xi_i^{*},

where \xi_i‘s are additional regularization parameters. The parameters \beta‘s can be found by iteratively quadratic optimization. The constraints are handled with Lagrangian’s multipliers.

For details, please refer to [Du et. al. 2015]. This way decouples, or smoothes out, the temporal covariation within the patients. A better prediction can be made.

Continue reading “Temporal Influence in Data Analysis”


Previously, I have went through heuristically the description of topology using homology groups in this entry. [Ho 2015] This is the essence of algebraic topology. We describe the topology using Betti numbers, the rank of the homolog groups. What they mean can be summarized as: [Bubenik 2015]

“… homology in degree 0 describes the connectedness of the data; homology in degree 1 detects holes and tunnels; homology in degree 2 captures voids; and so on.

Concept of Persistence

However, in computational problems, it is the discrete points that we are dealing with. We formulate their connectedness through constructing complexes, as described by my another blog entry. [Ho 2015] From the Wolfram Demonstration that I quoted previously, connectedness depends on some parameters, such as the radii of points that are considered connected. Whether it is Čech Complex, RP complex, or Alpha complex, the idea is similar. With discrete data, therefore, there is no definite answer how the connectedness among the points are, as it depends on the parameters.

Therefore, the concept of persistence has been developed to tackle this problem. This is the core concept for computational topology. There are a lot of papers about persistence, but the most famous work is done by Zomorodian and Carlsson, who algebraically studied it. [Zomorodian & Carlsson 2005] The idea is that as one increases the radii of points, the complexes change, and so do the homology groups. By varying the radii, we can observe which topology persists.

Persistence, and barcodes [Christ 2008]
From the diagram above, we can see that as the radii ε increase, the diagram becomes more connected. To understand the changes of homologies, there are a few ways. In the diagram above, barcodes represent the “life span” of a connected component as ε increases. The Betti numbers of a certain degree (0, 1, or 2 in this example) at a certain value of ε is the number of barcodes at that degree. For example, look at the left most vertical dashed line, \beta_0=10, as there are 10 barcodes existing for H_0. Note there are indeed 10 separate connected components. For the second leftmost vertical dashed line, \beta_0=6 (6 connected components), and \beta_1=2 (2 holes).

Another way is using the persistence diagram, basically plotting the “birth” and “death” times of all the barcodes above. For an explanation of persistence diagram, please refer to this blog entry by Sebastien Bubeck, [Bubeck 2013] or the paper by Fasy et. al. [Fasy et. al. 2014] Another way to describe persistent topology is the persistence landscape. [Bubenik 2015]

TDA Package in R

There are a lot of tools to perform topological data analysis. Ayasdi Core is a famous one. There are open sources C++ libraries such as Dionysus, or PHAT. There is a Python binding for Dionysus too.

There is a package in R that wraps Dionysus and PHAT, called TDA. To install it, simply open an R session, and enter


To load it, simply enter


We know that for a circle, \beta_0=\beta_1=1, as it has on connected components, and a hole. Prepare the circle and store it in X by the function circleUnif:

X<- circleUnif(n=1000, r=1)

Then we can see a 2-dimensional circle like this:

To calculate the persistent homology, use the function gridDiag:<- gridDiag(X=X, FUN=kde, h=0.3, lim=cbind(c(-1, 1), c(-1, 1)), by=0.01, sublevel = FALSE, library = 'PHAT', printProgress=FALSE)

To plot the barcodes and persistence diagram, enter:

plot($diagram, barcode=TRUE)


In the plots, black refers to degree 0, and red refers to degree 1.

We can play the same game by adding a horizontal line to cut the circle into two halves:

X<- circleUnif(n=1000, r=1)
hl<- matrix(c(seq(-1, 1, 0.02), rep(0, 201)), ncol=2)
X<- rbind(X, hl)

And the barcodes and persistence diagram are:


We can try this with three-dimensional objects like sphere, or torus, but I never finished the calculation in reasonable speeds.

Continue reading “Persistence”


Text mining can be applied on rap lyrics.

Today I attended an event organized by Data Science MD Meetup Group, a talk titled “Lose Yourself in Rapalytics,” by Abhay, a PhD student in University of Maryland, Baltimore County (UMBC). Rapalytics is an online tool analyzing raps.

It is another task of text mining and natural language processing. He mentioned a few common tools. However, he also specifically looked at rhymes (as rhyme is an important element of rap lyrics), and profanity (as rap music is commonly, or stereotypically, dirty).

Screen Shot 2015-11-19 at 11.51.33 PM.png
Analysis of the use of F*ck word in rap lyrics using Rapalytics

Play with it!

Continue reading “Rapalytics”

Topology in Physics and Computing

Topology has been shown to reveal important information about geometry and shape from data, [Carlsson 2015][Carlsson 2009] as I have talked about in various TDA blog entries. I have also demonstrated how to describe the topology if discrete data points by constructing simplicial complexes, and then calculated the homology and Betti numbers. (I will talk about persistent homology in the future.) Dealing with discrete data points in our digital storage devices, homology is the best way to describe it.

But if you are from a physics background, you may be familiar with the concept of homotopy and fundamental group. Some physicists deal with topology without digging into advanced mathematical tools but simply through solitons. There is a well-written introduction in this blog. In the physical world, an object is said to be topological if:

  • there is a singular point that cannot be removed by a continuous deformation of field; [Mermin 1979]
  • it has a saddle-point equation of the model that is different from another object of another topology, [Rajaraman 1987] inducing different kinds of physical dynamics; [Bray 1994]
  • it can only be removed by crossing an energy barrier, which can be described by an instanton; [Calzetta, Ho, Hu 2010]
  • it can proliferate through Kosterlitz-Thouless (BKT) phase transition; [Kosterliz, Thouless 1973]
  • it can form in a system through a second-order phase transition at a finite rate, a process known as Kibble-Zurek mechanism (KZM); [Kibble 1976] [Zurek 1985] and
  • its topology can be described by a winding number. (c.f. Betti numbers in homology)

Topological objects include vortices in magnets, superfluids, superconductors, or Skyrmions in helimagnets. [Mühlbauer et. al. 2009] [Ho et. al. 2010] They may come in honeycomb order, like Abrikosov vortices in type-II superconductors, [Abrikosov 1957] and helical nanofilaments in smectics. [Matsumoto et. al. 2009] It is widely used in fractional quantum Hall effect [Tsui et. al. 1982] and topological insulators (a lot of references can be found…). They can all be described using homotopy and winding numbers. We can see that topology is useful to describe the physical world for the complexities and patterns. There are ideas in string-net theory to use topology to describe the emergence of patterns and new phases of quantum matter. [Zeng et. al. 2015] Of course, I must not omit topological quantum computing that makes the qubits immune to environmental noise. [Das Sarma, Freedman, Nayak 2005]

However in data analytics, we do not use homotopy, albeit its beauty and usefulness in the physical world. Here are some of the reasons:

  • In using homotopy, sometimes it takes decades for a lot of brains to figure out which homotopy groups to use. But in data analysis, we want to grasp the topology simply from data.
  • Homotopy deals with continuous mappings, but data are discrete. Simplicial homology captures it more easily.
  • In a physical system, we deal with usually one type of homotopy groups. But in data, we often deal with various topologies which we are not aware of in advance. Betti numbers can describe the topology easily by looking at data.
  • Of course, homotopy is difficult to compute numerically.

Afra Zomorodian argued the use of homology over homotopy in his book as well. [Zomorodian 2009]

Continue reading “Topology in Physics and Computing”

Homology and Betti Numbers

We have been talking about the elements of topological data analysis. In my previous post, I introduced simplicial complexes, concerning the ways to connect points together. In topology, it is the shape and geometry, not distances, which matter ( although while constructing the distance does play a role).

With the simplicial complexes, we can go ahead to describe its topology. We will use the techniques in algebraic topology without going into too much details. The techniques involves homology, but a full explanation of it requires the concepts of normal subgroup, kernel, image, quotient group in group theory. I will not talk about them, although I admit that there is no easy ways to talk about computational topology without touching them. I highly recommend the readers can refer to Zomorodian’s textbook for more details. [Zomorodian 2009]

I will continue with the Python class


that I wrote in the previous blog post. Suppose we have an k-simplex, then the n-th face is any combinations with n+1 vertices. A simplicial complex is such that a face contained in a face is also a face of the complex. In this, we can define the boundary operator by

\partial_k \sigma = \sum_i (-1)^i [v_0 v_1 \ldots \hat{v}_i \ldots v_k],

where \hat{v}_i indicates the i-th vertex be removed. This operator gives all the boundary faces of a face \sigma. The faces being operated are k-faces, and this operator will be mapped to a (k-1)-faces. Then the boundary operator can be seen as a (n_k \times n_{k-1})-matrix, where n_k is the number of k-faces. This can be easily calculated with the following method:

class SimplicialComplex:
  def boundary_operator(self, i):
    source_simplices = self.n_faces(i)
    target_simplices = self.n_faces(i-1)

    if len(target_simplices)==0:
      S = dok_matrix((1, len(source_simplices)), dtype=np.float32)
      S[0, 0:len(source_simplices)] = 1
      source_simplices_dict = {}
      for j in range(len(source_simplices)):
        source_simplices_dict[source_simplices[j]] = j
      target_simplices_dict = {}
      for i in range(len(target_simplices)):
        target_simplices_dict[target_simplices[i]] = i

      S = dok_matrix((len(target_simplices), len(source_simplices)), dtype=np.float32)
      for source_simplex in source_simplices:
        for a in range(len(source_simplex)):
          target_simplex = source_simplex[:a]+source_simplex[(a+1):]
          i = target_simplices_dict[target_simplex]
          j = source_simplices_dict[source_simplex]
          S[i, j] = -1 if a % 2==1 else 1 # S[i, j] = (-1)**a
   return S

With the boundary operator, we can calculate the Betti numbers that characterize uniquely the topology of the shapes. Actually it involves the concept of homology groups that we are going to omit. To calculate the k-th Betti numbers, we calculate:

\beta_k = \text{rank} (\text{ker} \partial_k) - \text{rank} (\text{Im} \partial_{k+1}).

By rank-nullity theorem, [Jackson]

\text{rank} (\text{ker} \partial_k) +\text{rank} (\text{Im} \partial_k) = \text{dim} (\partial_k)

the Betti number is then

\beta_k = \text{dim} (\partial_k) - \text{rank}(\text{Im} \partial_k)) - \text{rank} (\text{Im} \partial_{k+1})

where the rank of the image of an operator can be easily computed using the rank method available in numpy. Then the method of calculating the Betti number is

class SimplicialComplex:
  def betti_number(self, i):
    boundop_i = self.boundary_operator(i)
    boundop_ip1 = self.boundary_operator(i+1)

    if i==0:
      boundop_i_rank = 0
        boundop_i_rank = np.linalg.matrix_rank(boundop_i.toarray())
      except np.linalg.LinAlgError:
        boundop_i_rank = boundop_i.shape[1]
      boundop_ip1_rank = np.linalg.matrix_rank(boundop_ip1.toarray())
    except np.linalg.LinAlgError:
      boundop_ip1_rank = boundop_ip1.shape[1]

    return ((boundop_i.shape[1]-boundop_i_rank)-boundop_ip1_rank)

If we draw a simplicial complex on a 2-dimensional plane, we almost have \beta_0, \beta_1 and \beta_2. $\beta_0$ indicates the number of components, \beta_1 the number of bases for a tunnel, and \beta_2 the number of voids.

Let’s have some examples. Suppose we have a triangle, not filled.

e1 = [(0, 1), (1, 2), (2, 0)]
sc1 = SimplicialComplex(e1)

Then the Betti numbers are:

In [5]: sc1.betti_number(0)
Out[5]: 1

In [6]: sc1.betti_number(1)
Out[6]: 1

In [7]: sc1.betti_number(2)
Out[7]: 0

Let’s try another example with multiple components.

e2 = [(1,2), (2,3), (3,1), (4,5,6), (6,7), (7,4)]
sc2 = SimplicialComplex(e2)

We can graphically represent it using networkx:

import networkx as nx
import matplotlib.pyplot as plt
n2 = nx.Graph()
Simplicial Complex of e2
Simplicial Complex of e2

And its Betti numbers are as follow:

In [13]: sc2.betti_number(0)
Out[13]: 2

In [14]: sc2.betti_number(1)
Out[14]: 2

In [15]: sc2.betti_number(2)
Out[15]: 0

A better illustration is the Wolfram Demonstration, titled “Simplicial Homology of the Alpha Complex”.

On top of the techniques in this current post, we can describe the homology of discrete points using persistent homology, which I will describe in my future posts. I will probably spend a post on homotopy in comparison to other types of quantitative problems.

Continue reading “Homology and Betti Numbers”

Talking Not So Deep About Deep Learning


On October 14, 2015, I attended the regular meeting of the DCNLP meetup group, a group on natural language processing (NLP) in Washington, DC area. The talk was titled “Deep Learning for Question Answering“, spoken by Mr. Mohit Iyyer, a Ph.D. student in Department of Computer Science, University of Maryland (my alma mater!). He is a very good speaker.

I have no experience on deep learning at all although I did write a blog post remotely related. I even didn’t start training my first neural network until the next day after the talk. However, Mr. Iyyer explained what recurrent neural network (RNN), recursive neural network, and deep averaging network (DAN) are. This helped me a lot in order to understanding more about the principles of the famous word2vec model (which is something I am going to write about soon!). You can refer to his slides for more details. There are really a lot of talents in College Park, like another expert, Joe Yue Hei Ng, who is exploiting deep learning a lot as well.

The applications are awesome: with external knowledge to factual question answering, reasoning-based question answering, and visual question answering, with increasing order of challenging levels.

Mr. Iyyer and the participants discussed a lot about different packages. Mr. Iyyer uses Theano, a Python package for deep learning, which is good for model building and other analytical work. Some prefer Caffe. Some people, who are Java developers, also use deeplearning4j.

Stetsons Famous Bar & Grill (photo from Yelp)

This meetup was a sacred one too, because it is the last time it was held in Stetsons Famous Bar & Grill at U Street, which is going to permanently close on Halloween this year. The group is eagerly looking for a new venue for the upcoming meetup. This meeting was a crowded one. I sincerely thank the organizers, Charlie Greenbacker and Liz Merkhofer, for hosting all these meetings, and Chris Phipps (a linguist from IBM Watson) for recording.


Continue reading “Talking Not So Deep About Deep Learning”

Create a free website or blog at

Up ↑