Simple Literary Analytics on Presidential Candidates in the First 2016 Presidential Debate

The first presidential debate 2016 was held on September 26, 2016 in Hofstra University in New York. An interesting analysis will be the literacy level demonstrated by the two candidates using Flesch readability ease and Flesch-Kincaid grade level, demonstrated in my previous blog entry and my Github: stephenhky/PyReadability.

First, we need to get the transcript of the debate, which can be found in an article in New York Times. Copy and paste the text into a file called first_debate_transcript.txt. Then we want to extract out speech of each person. To do this, store the following Python code in

# Trump and Clinton 1st debate on Sept 26, 2016

from nltk import word_tokenize
from collections import defaultdict
import re

# adopted from
def untokenize(words):
    Untokenizing a text undoes the tokenizing operation, restoring
    punctuation and spaces to the places that people expect them to be.
    Ideally, `untokenize(tokenize(text))` should be identical to `text`,
    except for line breaks.
    text = ' '.join(words)
    step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .',  '...')
    step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
    step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2)
    step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3)
    step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
         "can not", "cannot")
    step6 = step5.replace(" ` ", " '")
    return step6.strip()

ignored_phrases = ['(APPLAUSE)', '(CROSSTALK)']
persons = ['TRUMP', 'CLINTON', 'HOLT']
fin = open('first_debate_transcript.txt', 'rb')
lines = fin.readlines()

lines = filter(lambda s: len(s)>0, map(lambda s: s.strip(), lines))
speeches = defaultdict(lambda : '')
person = None

for line in lines:
    tokens = word_tokenize(line.strip())
    ignore_colon = False
    added_tokens = []
    for token in tokens:
        if token in ignored_phrases:
        elif token in persons:
            person = token
            ignore_colon = True
        elif token == ':':
            ignore_colon = False
            added_tokens += [token]
            speeches[person] += ' ' + untokenize(added_tokens)

for person in persons:
    fout = open('speeches_'+person+'.txt', 'wb')

There is an untokenize function adapted from a code in StackOverflow. This segmented the transcript into the individual speech of Lester Holt (the host of the debate), Donald Trump (GOP presidential candidate), and Hillary Clinton (DNC presidential candidate) in separate files. Then, on UNIX or Linux command line, run on each person’s script, by, for example, for Holt’s speech,

python speeches_HOLT.txt --utf8

Beware that it is encoded in UTF-8. For Lester Holt, we have

Word count = 1935
Sentence count = 157
Syllable count = 2732
Flesch readability ease = 74.8797052289
Flesch-Kincaid grade level = 5.87694629602

For Donald Trump,

Word count = 8184
Sentence count = 693
Syllable count = 10665
Flesch readability ease = 84.6016324536
Flesch-Kincaid grade level = 4.3929136992

And for Hillary Clinton,

Word count = 6179
Sentence count = 389
Syllable count = 8395
Flesch readability ease = 75.771973015
Flesch-Kincaid grade level = 6.63676650035

Apparently, compared to Donald Trump, Hillary Clinton has a higher literary level, but her speech is less easy to understand.

Recalling from my previous entry, for Shakespeare’s MacBeth, the Flesch readability ease is 112.278048591, and Flesch-Kincard grade level 0.657934056288; for King James Version Bible (KJV), they are 79.6417489428 and 9.0085275366 respectively.

This is just a simple text analytics. However, the content is not analyzed here. Augustine of Hippo wrote in his Book IV of On Christian Teaching (Latin: De doctrina christiana) about rhetoric and eloquence:

“… wisdom without eloquence is of little value to the society… eloquence without wisdom is… a great nuisance, and never beneficial.” — Augustine of Hippo, Book IV of On Christian Teaching


Continue reading “Simple Literary Analytics on Presidential Candidates in the First 2016 Presidential Debate”


Linking Fundamental Physics to Deep Learning

Ever since Mehta and Schwab laid out the relationship between restricted Boltzmann machines (RBM) and deep learning mathematically (see my previous entry), scientists have been discussing why deep learning works so well. Recently, Henry Lin and Max Tegmark put a preprint on arXiv (arXiv:1609.09225), arguing that deep learning works because it captures a few essential physical laws and properties. Tegmark is a cosmologist.

Physical laws are simple in a way that a few properties, such as locality, symmetry, hierarchy etc., lead to large-scale, universal, and often complex phenomena. A lot of machine learning algorithms, including deep learning algorithms, have deep relations with formalisms outlined in statistical mechanics.

A lot of machine learning algorithms are basically probability theory. They outlined a few types of algorithms that seek various types of probabilities. They related the probabilities to Hamiltonians in many-body systems.

They argued why neural networks can approximate functions (polynomials) so well, giving a simple neural network performing multiplication. With central limit theorem or Jaynes’ arguments (see my previous entry), a lot of multiplications, they said, can be approximated by low-order polynomial Hamiltonian. This is like a lot of many-body systems that can be approximated by 4-th order Landau-Ginzburg-Wilson (LGW) functional.

Properties such as locality reduces the number of hyper-parameters needed because it restricts to interactions among close proximities. Symmetry further reduces it, and also computational complexities. Symmetry and second order phase transition make scaling hypothesis possible, leading to the use of the tools such as renormalization group (RG). As many people have been arguing, deep learning resembles RG because it filters out unnecessary information and maps out the crucial features. Tegmark use classifying cats vs. dogs as an example, as in retrieving temperatures of a many-body systems using RG procedure. They gave a counter-example to Schwab’s paper with the probabilities cannot be preserved by RG procedure, but while it is sound, but it is not the point of the RG procedure anyway.

They also discussed about the no-flattening theorems for neural networks.

Continue reading “Linking Fundamental Physics to Deep Learning”

rJava: Running Java from R, and Building R Packages Wrapping a .jar

While performing exploratory analysis, R is a good tool, although we sometimes want to invoke some stable Java tools. It is what the R Package rJava is for. To install it, simply enter on the R Console:


And to load it, enter:


As a simple demonstration, we find the length of a strength. Start the JVM, enter:


Then we create an instance of a Java string, and find its length as follow:

s <- .jnew('java/lang/String', 'Hello World!')
.jcall(s, 'I', 'length')

The first line, with the function .jnew, create a Java string instance. It is safe to put the full package path of the class. The second line, with the function .jcall, call the method length() for String. The second parameter, ‘I’, indicates it returns an integer. The type has to follow the JNI notation for native types. If it is an integer double array, type ‘I[[‘. If it is not a native class like String, use its total package path.

Example: Peter Norvig’s Spell Corrector Written in Scala

What should we do if we already have a .jar file we want to wrap? I would start with a simple one. Two years ago, I implemented Peter Norvig’s spell corrector (see his article) in Scala (which is a language for Java Virtual Machine (JVM) as well, see this entry), and posted on my Github repository: stephenhky/SpellCorrector. You may check out to your Eclipse or IntelliJ IDEA, and build a .jar file. (Or you can download the .jar file here.) For the program to run, do not forget to download his corpus named big.txt. The project has a class called SpellCorrector, which only the necessary codes are listed below:

package home.kwyho.spellcheck


import scala.collection.mutable.Map

class SpellCorrector {
 var wordCounts : Map[String, Int] = Map()
 val alphabets = ('a' to 'z').toSet

 def train(trainFile : File) = {
    val lines = Source.fromFile(trainFile) mkString
    val wordREPattern = "[A-Za-z]+"
    wordREPattern.r.findAllIn(lines).foreach( txtWord => {
       val word = txtWord.toLowerCase
       if (wordCounts.keySet contains(word)) {
          wordCounts(word) = wordCounts(word)+1
       } else {
          wordCounts += (word -> 1)

// other codes here ....

 def correct(wrongSpelling: String) : String = {
    val edit0words = Set(wrongSpelling) intersect wordCounts.keySet
    if (edit0words.size>0) return edit0words.maxBy( s => wordCounts(s))
    val edit1words = getEditOneSpellings(wrongSpelling)
    if (edit1words.size>0) return edit1words.maxBy( s => wordCounts(s))
    val edit2words = getEditTwoSpellings(wrongSpelling)
    edit2words.maxBy( s => wordCounts(s))

Putting the .jar file and big.txt into the same folder. Then initialize the JVM, and add the .jar file into the classpath:


Create an instance for SpellChecker, and train the corpus big.txt. Remember to put the whole package path as the class:

corrector <- .jnew('home/kwyho/spellcheck/SpellCorrector')
bigfile <- .jnew('java/io/File', 'big.txt')
.jcall(corrector, 'V', 'train', bigfile)

The first line create a SpellChecker instance, the second line create a File instance for big.txt, and the third line call the train() method. The JNI notation ‘V’ denotes ‘void.’ Entering ‘corrector’ will give a string indicates it is a Java object:

[1] "Java-Object{home.kwyho.spellcheck.SpellCorrector@5812f9ee}"

Then we can do spell correction by designing the following function:

correct<-function(word) {
   javaStrtext <- .jnew('java/lang/String', word)
   .jcall(corrector, 'Ljava/lang/String;', 'correct', javaStrtext)

Then you can easily perform spell correction as follow:


Some people put .class file instead of .jar file. In that case, you need to put the compiled Java class into the working directory. You can refer to an entry in Darren Wilkinson’s research blog for more details.

Building an R Package

It is another matter to build an R package that wraps a .jar file. In Hilary Parker’s entry and my previous entry, there are details about building an R package with roxygen2. There is also a documentation written by Tobias Verbeke.

So to start building it, in RStudio, start a project by clicking on the button “Project: (None)” on the top right corner of RStudio, choose “New Directory,” and then “R Package.” Type in the name (“RSpellCorrection” here), and specify a directory. Then click “Create Project.” A new RStudio window will show up. From the menu bar, choose “Build” > “Configure Build Tools”. Then click on “Configure…” button. There is a dialog box coming out. Check everything, and click “OK”.


The instructions above are rather detailed. But starting from now, I will skip the procedural details. Then start a file named, say, onLoad.R under the subfolder R/, and put the following codes there:

.onLoad <- function(libname, pkgname) {
  .jpackage(pkgname, lib.loc=libname)

This is a hook function that R will call when this package is being loaded. You must include it. Then in the file named DESCRIPTION, put in the relevant information:

Package: RSpellCorrection
Type: Package
Title: Spell Correction, Scala implementation run in R
Version: 0.1.0
Author: Kwan-Yuet Ho, Ph.D.
Maintainer: Kwan-Yuet Ho, Ph.D. <>
Description: Implementation of Peter Norvig's spell corrector in Scala, wrapped in R
License: N/A
LazyData: TRUE
RoxygenNote: 5.0.1
Depends: R(>= 2.7.0), rJava (>= 0.5-0)

Note the last line (“Depends…”), which you have to include because R will parse this line, and load rJava automatically. Remember there is a space between “>=” and the version number. Do not use library function in your code.

First, create a subfolder inst/java, and put the .jar file there.

Then start a file, called correct.R under subfolder R/, and write a function:

#' Retrieve a Java instance of SpellCorrector.
#' Retrieve a Java instance of SpellCorrector, with the training file
#' specified. Language model is trained before the instance is returned.
#' The spell corrector is adapted from Peter Norvig's demonstration.
#' @param filepath Path of the corpus.
#' @return a Java instance of SpellCorrector
#' @export
getcorrector<-function(filepath='big.txt') {
    .jaddLibrary('spellchecker', 'inst/java/spellcorrector.jar')
    corrector<- .jnew('home/kwyho/spellcheck/SpellCorrector')
    bigfile<- .jnew('java/io/File', filepath)
    .jcall(corrector, 'V', 'train', bigfile)

This return a Java instance of SpellCorrector as in previous section. There is a large block of text above the function, and they are for producing manual using roxygen2. The tag “@export” is important to tell roxygen2 to make this function visible to the users.

Then add another function:

#' Correct spelling.
#' Given an instance of SpellCorrector, return the most probably
#' corrected spelling of the given word.
#' @param word A token.
#' @param corrector A Java instance of SpellCorrector, given by \code{getcorrector}.
#' @return Corrected spelling
#' @export
correct<-function(word, corrector) {
    javaStrtext <- .jnew('java/lang/String', word)
    .jcall(corrector, 'Ljava/lang/String;', 'correct', javaStrtext)

Then click “Build & Reload” button on the “Build” Tab:


Then the package will be built, and reloaded. The manual documents (*.Rd) will be produced as well. You can then play with the spell corrector again like this:


Assuming you put this into the Github repository like I did (link here), you can install the new R package like this:


Then the R package will be downloaded, and installed for use. Or another option is that if you wish to install from your local directory, just enter:

install.packages('<path-to>/RSpellCorrection', repos = NULL, type = 'source')

A complete version of this R package can be found in my Github repository: stephenhky/RSpellCorrection. You may want to add a into the repository, which you need to know the Markdown language by referring to Lei Feng’s blog entry.

Continue reading “rJava: Running Java from R, and Building R Packages Wrapping a .jar”

Developing R Packages

Because of work, I developed two R packages to host the functions that I used a lot. It did bring me a lot of convenience, such as that I don’t have to start my data analysis in a particular folder and switch later on.

To do that, you need to use RStudio. Then you have to install devtools package by calling in the R console:


and load it by simply call:


And then you have to install the roxygen2 package by calling:


There are a lot of good tutorials about writing an R package. I especially like this Youtube video clip about building an R package with RStudio and roxygen2:

And Hilary Parker’s blog entry is useful as well.

On the other hand, if you are publishing your R package onto your Github repository, it would be nice to include a README file introducing your work. You need to know the Markdown language to write the file named, and put it onto the root folder of your repository. My friend, Qianli Deng, showed me this Lei Feng’s blog entry, which I found extremely useful. Markdown is remarkably simpler than LaTeX.

Continue reading “Developing R Packages”

SOCcer: Computerized Coding In Epidemiology

There are many tasks that involve coding, for example, putting kids into groups according to their age, labeling the webpages about their kinds, or putting students in Hogwarts into four colleges… And researchers or lawyers need to code people, according to their filled-in information, into occupations. Melissa Friesen, an investigator in Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NCI), National Institutes of Health (NIH), saw the need of large-scale coding. Many researchers are dealing with big data concerning epidemiology. She led a research project, in collaboration with Office of Intramural Research (OIR), Center for Information Technology (CIT), National Institutes of Health (NIH), to develop an artificial intelligence system to cope with the problem. This leads to a publicly available tool called SOCcer, an acronym for “Standardized Occupation Coding for Computer-assisted Epidemiological Research.” (URL:

The system was initially developed in an attempt to find the correlation between the onset of cancers and other diseases and the occupation. “The application is not intended to replace expert coders, but rather to prioritize which job descriptions would benefit most from expert review,” said Friesen in an interview. She mainly works with Daniel Russ in CIT.

SOCcer takes job title, industry codes (in terms of SIC, Standard Industrial Classification), and job duties, and gives an occupational code called SOC 2010 (Standard Occupational Classification), used by U. S. federal government agencies. The data involves short text, often messy. There are 840 codes in SOC 2010 systems. Conventional natural language processing (NLP) methods may not apply. Friesen, Russ, and Kwan-Yuet (Stephen) Ho (also in OIR, CIT; a CSRA staff) use fuzzy logic, and maximum entropy (maxent) methods, with some feature engineering, to build various classifiers. These classifiers are aggregated together, as in stacked generalization (see my previous entry), using logistic regression, to give a final score.

SOCcer has a companion software, called SOCAssign, for expert coders to prioritize the codings. It was awarded with DCEG Informatics Tool Challenge 2015. SOCcer itself was awarded in 2016. And the SOCcer team was awarded for Scientific Award of Merit by CIT/OCIO in 2016 as well (see this). Their work was published in Occup. Environ. Med.


Continue reading “SOCcer: Computerized Coding In Epidemiology”

Blog at

Up ↑