Simple Literary Analytics on Presidential Candidates in the First 2016 Presidential Debate

The first presidential debate 2016 was held on September 26, 2016 in Hofstra University in New York. An interesting analysis will be the literacy level demonstrated by the two candidates using Flesch readability ease and Flesch-Kincaid grade level, demonstrated in my previous blog entry and my Github: stephenhky/PyReadability.

First, we need to get the transcript of the debate, which can be found in an article in New York Times. Copy and paste the text into a file called first_debate_transcript.txt. Then we want to extract out speech of each person. To do this, store the following Python code in first_debate_segment.py.

# Trump and Clinton 1st debate on Sept 26, 2016

from nltk import word_tokenize
from collections import defaultdict
import re

# adopted from http://stackoverflow.com/questions/21948019/python-untokenize-a-sentence
def untokenize(words):
    """
    Untokenizing a text undoes the tokenizing operation, restoring
    punctuation and spaces to the places that people expect them to be.
    Ideally, `untokenize(tokenize(text))` should be identical to `text`,
    except for line breaks.
    """
    text = ' '.join(words)
    step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .',  '...')
    step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
    step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2)
    step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3)
    step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
         "can not", "cannot")
    step6 = step5.replace(" ` ", " '")
    return step6.strip()

ignored_phrases = ['(APPLAUSE)', '(CROSSTALK)']
persons = ['TRUMP', 'CLINTON', 'HOLT']
fin = open('first_debate_transcript.txt', 'rb')
lines = fin.readlines()
fin.close()

lines = filter(lambda s: len(s)>0, map(lambda s: s.strip(), lines))
speeches = defaultdict(lambda : '')
person = None

for line in lines:
    tokens = word_tokenize(line.strip())
    ignore_colon = False
    added_tokens = []
    for token in tokens:
        if token in ignored_phrases:
            pass
        elif token in persons:
            person = token
            ignore_colon = True
        elif token == ':':
            ignore_colon = False
        else:
            added_tokens += [token]
            speeches[person] += ' ' + untokenize(added_tokens)

for person in persons:
    fout = open('speeches_'+person+'.txt', 'wb')
    fout.write(speeches[person])
    fout.close()

There is an untokenize function adapted from a code in StackOverflow. This segmented the transcript into the individual speech of Lester Holt (the host of the debate), Donald Trump (GOP presidential candidate), and Hillary Clinton (DNC presidential candidate) in separate files. Then, on UNIX or Linux command line, run score_readability.py on each person’s script, by, for example, for Holt’s speech,

python score_readability.py speeches_HOLT.txt --utf8

Beware that it is encoded in UTF-8. For Lester Holt, we have

Word count = 1935
Sentence count = 157
Syllable count = 2732
Flesch readability ease = 74.8797052289
Flesch-Kincaid grade level = 5.87694629602

For Donald Trump,

Word count = 8184
Sentence count = 693
Syllable count = 10665
Flesch readability ease = 84.6016324536
Flesch-Kincaid grade level = 4.3929136992

And for Hillary Clinton,

Word count = 6179
Sentence count = 389
Syllable count = 8395
Flesch readability ease = 75.771973015
Flesch-Kincaid grade level = 6.63676650035

Apparently, compared to Donald Trump, Hillary Clinton has a higher literary level, but her speech is less easy to understand.

Recalling from my previous entry, for Shakespeare’s MacBeth, the Flesch readability ease is 112.278048591, and Flesch-Kincard grade level 0.657934056288; for King James Version Bible (KJV), they are 79.6417489428 and 9.0085275366 respectively.

This is just a simple text analytics. However, the content is not analyzed here. Augustine of Hippo wrote in his Book IV of On Christian Teaching (Latin: De doctrina christiana) about rhetoric and eloquence:

“… wisdom without eloquence is of little value to the society… eloquence without wisdom is… a great nuisance, and never beneficial.” — Augustine of Hippo, Book IV of On Christian Teaching

694940094001_5142607252001_highlights-from-the-first-presidential-debate

Continue reading “Simple Literary Analytics on Presidential Candidates in the First 2016 Presidential Debate”

rJava: Running Java from R, and Building R Packages Wrapping a .jar

While performing exploratory analysis, R is a good tool, although we sometimes want to invoke some stable Java tools. It is what the R Package rJava is for. To install it, simply enter on the R Console:

install.packages('rJava')

And to load it, enter:

library(rJava)

As a simple demonstration, we find the length of a strength. Start the JVM, enter:

.jinit('.')

Then we create an instance of a Java string, and find its length as follow:

s <- .jnew('java/lang/String', 'Hello World!')
.jcall(s, 'I', 'length')

The first line, with the function .jnew, create a Java string instance. It is safe to put the full package path of the class. The second line, with the function .jcall, call the method length() for String. The second parameter, ‘I’, indicates it returns an integer. The type has to follow the JNI notation for native types. If it is an integer double array, type ‘I[[‘. If it is not a native class like String, use its total package path.

Example: Peter Norvig’s Spell Corrector Written in Scala

What should we do if we already have a .jar file we want to wrap? I would start with a simple one. Two years ago, I implemented Peter Norvig’s spell corrector (see his article) in Scala (which is a language for Java Virtual Machine (JVM) as well, see this entry), and posted on my Github repository: stephenhky/SpellCorrector. You may check out to your Eclipse or IntelliJ IDEA, and build a .jar file. (Or you can download the .jar file here.) For the program to run, do not forget to download his corpus named big.txt. The project has a class called SpellCorrector, which only the necessary codes are listed below:

package home.kwyho.spellcheck

/*
 Reference: http://norvig.com/spell-correct.html
 */

import java.io.File
import scala.io.Source
import scala.collection.mutable.Map

class SpellCorrector {
 var wordCounts : Map[String, Int] = Map()
 val alphabets = ('a' to 'z').toSet

 def train(trainFile : File) = {
    val lines = Source.fromFile(trainFile) mkString
    val wordREPattern = "[A-Za-z]+"
    wordREPattern.r.findAllIn(lines).foreach( txtWord => {
       val word = txtWord.toLowerCase
       if (wordCounts.keySet contains(word)) {
          wordCounts(word) = wordCounts(word)+1
       } else {
          wordCounts += (word -> 1)
       }
    })
 }

// other codes here ....

 def correct(wrongSpelling: String) : String = {
    val edit0words = Set(wrongSpelling) intersect wordCounts.keySet
    if (edit0words.size>0) return edit0words.maxBy( s => wordCounts(s))
    val edit1words = getEditOneSpellings(wrongSpelling)
    if (edit1words.size>0) return edit1words.maxBy( s => wordCounts(s))
    val edit2words = getEditTwoSpellings(wrongSpelling)
    edit2words.maxBy( s => wordCounts(s))
 }
}

Putting the .jar file and big.txt into the same folder. Then initialize the JVM, and add the .jar file into the classpath:

.jinit('.')
.jaddClassPath('spellcorrector.jar')

Create an instance for SpellChecker, and train the corpus big.txt. Remember to put the whole package path as the class:

corrector <- .jnew('home/kwyho/spellcheck/SpellCorrector')
bigfile <- .jnew('java/io/File', 'big.txt')
.jcall(corrector, 'V', 'train', bigfile)

The first line create a SpellChecker instance, the second line create a File instance for big.txt, and the third line call the train() method. The JNI notation ‘V’ denotes ‘void.’ Entering ‘corrector’ will give a string indicates it is a Java object:

[1] "Java-Object{home.kwyho.spellcheck.SpellCorrector@5812f9ee}"

Then we can do spell correction by designing the following function:

correct<-function(word) {
   javaStrtext <- .jnew('java/lang/String', word)
   .jcall(corrector, 'Ljava/lang/String;', 'correct', javaStrtext)
}

Then you can easily perform spell correction as follow:

img

Some people put .class file instead of .jar file. In that case, you need to put the compiled Java class into the working directory. You can refer to an entry in Darren Wilkinson’s research blog for more details.

Building an R Package

It is another matter to build an R package that wraps a .jar file. In Hilary Parker’s entry and my previous entry, there are details about building an R package with roxygen2. There is also a documentation written by Tobias Verbeke.

So to start building it, in RStudio, start a project by clicking on the button “Project: (None)” on the top right corner of RStudio, choose “New Directory,” and then “R Package.” Type in the name (“RSpellCorrection” here), and specify a directory. Then click “Create Project.” A new RStudio window will show up. From the menu bar, choose “Build” > “Configure Build Tools”. Then click on “Configure…” button. There is a dialog box coming out. Check everything, and click “OK”.

img1.png

The instructions above are rather detailed. But starting from now, I will skip the procedural details. Then start a file named, say, onLoad.R under the subfolder R/, and put the following codes there:

.onLoad <- function(libname, pkgname) {
  .jpackage(pkgname, lib.loc=libname)
}

This is a hook function that R will call when this package is being loaded. You must include it. Then in the file named DESCRIPTION, put in the relevant information:

Package: RSpellCorrection
Type: Package
Title: Spell Correction, Scala implementation run in R
Version: 0.1.0
Author: Kwan-Yuet Ho, Ph.D.
Maintainer: Kwan-Yuet Ho, Ph.D. <stephenhky@yahoo.com.hk>
Description: Implementation of Peter Norvig's spell corrector in Scala, wrapped in R
License: N/A
LazyData: TRUE
RoxygenNote: 5.0.1
Depends: R(>= 2.7.0), rJava (>= 0.5-0)

Note the last line (“Depends…”), which you have to include because R will parse this line, and load rJava automatically. Remember there is a space between “>=” and the version number. Do not use library function in your code.

First, create a subfolder inst/java, and put the .jar file there.

Then start a file, called correct.R under subfolder R/, and write a function:

#' Retrieve a Java instance of SpellCorrector.
#'
#' Retrieve a Java instance of SpellCorrector, with the training file
#' specified. Language model is trained before the instance is returned.
#' The spell corrector is adapted from Peter Norvig's demonstration.
#'
#' @param filepath Path of the corpus.
#' @return a Java instance of SpellCorrector
#' @export
getcorrector<-function(filepath='big.txt') {
    .jaddLibrary('spellchecker', 'inst/java/spellcorrector.jar')
    .jaddClassPath('inst/java/spellcorrector.jar')
    corrector<- .jnew('home/kwyho/spellcheck/SpellCorrector')
    bigfile<- .jnew('java/io/File', filepath)
    .jcall(corrector, 'V', 'train', bigfile)
    return(corrector)
}

This return a Java instance of SpellCorrector as in previous section. There is a large block of text above the function, and they are for producing manual using roxygen2. The tag “@export” is important to tell roxygen2 to make this function visible to the users.

Then add another function:

#' Correct spelling.
#'
#' Given an instance of SpellCorrector, return the most probably
#' corrected spelling of the given word.
#'
#' @param word A token.
#' @param corrector A Java instance of SpellCorrector, given by \code{getcorrector}.
#' @return Corrected spelling
#' @export
correct<-function(word, corrector) {
    javaStrtext <- .jnew('java/lang/String', word)
    .jcall(corrector, 'Ljava/lang/String;', 'correct', javaStrtext)
}

Then click “Build & Reload” button on the “Build” Tab:

img2.png

Then the package will be built, and reloaded. The manual documents (*.Rd) will be produced as well. You can then play with the spell corrector again like this:

img3

Assuming you put this into the Github repository like I did (link here), you can install the new R package like this:

library(devtools)
install_github('stephenhky/RSpellCorrection')

Then the R package will be downloaded, and installed for use. Or another option is that if you wish to install from your local directory, just enter:

install.packages('<path-to>/RSpellCorrection', repos = NULL, type = 'source')

A complete version of this R package can be found in my Github repository: stephenhky/RSpellCorrection. You may want to add a README.md into the repository, which you need to know the Markdown language by referring to Lei Feng’s blog entry.

Continue reading “rJava: Running Java from R, and Building R Packages Wrapping a .jar”

Developing R Packages

Because of work, I developed two R packages to host the functions that I used a lot. It did bring me a lot of convenience, such as that I don’t have to start my data analysis in a particular folder and switch later on.

To do that, you need to use RStudio. Then you have to install devtools package by calling in the R console:

install.packages('devtools')

and load it by simply call:

library(devtools)

And then you have to install the roxygen2 package by calling:

install_github("klutometis/roxygen")
library(roxygen2)

There are a lot of good tutorials about writing an R package. I especially like this Youtube video clip about building an R package with RStudio and roxygen2:

And Hilary Parker’s blog entry is useful as well.

On the other hand, if you are publishing your R package onto your Github repository, it would be nice to include a README file introducing your work. You need to know the Markdown language to write the file named README.md, and put it onto the root folder of your repository. My friend, Qianli Deng, showed me this Lei Feng’s blog entry, which I found extremely useful. Markdown is remarkably simpler than LaTeX.

Continue reading “Developing R Packages”

Coffee Cooling: a Demo of Numerical ODEs in Python

A lot of Americans are addicted to coffee. I have one cup a day. I know someone has a few cups every day. Weird enough, I know someone who needs coffee to go to bed.

Coffee is for sipping, but we do want to enjoy it as soon as possible at a right temperature. Whether we add sugar cubes at the beginning or the end can make a lot of difference. In general, coffee cooling follows empirically the Newton’s law of cooling:

\frac{dT}{dt} = -k (T-T_r),

where T_r is the room temperature, and k is the cooling parameter. It is a simple ordinary differential equation (ODE) that bears analytical solution. However, let’s solve it numerically while we demonstrate the NumPy/SciPy capabilities. The function we are using is odeint. Let’s import necessary packages:

import numpy as np
from scipy.integrate import odeint
from functools import partial

We will need the partial function from functools shortly. Let’s define the right-hand-side of the ODE above as the cooling function in terms of a lambda expression:

coolingfcn = lambda temp, t0, k, roomtemp: -k*(temp-roomtemp)

And we need a set of cooling parameters that makes sense from our daily experience:

cooling_parameters = {'init_temp': 94.0, # initial temperature of the coffee (degrees Celcius)
                      'cube_dectemp': 5.0, # temperature decrease due to the cube
                      'roomtemp': 25.0, # room temperature
                      'k': 1.0/15.0 # cooling parameter (inverse degrees Celcius)
                     }

The fresh-brewed coffee is of 94 degrees Celcius. The room temperature is 25 degrees Celcius. Whenever a whole cube is added to the coffee, its temperature drops by 5 degrees Celcius, which is not a good assumption when the coffee is totally cooled down to room temperature. The cooling parameter k=\frac{1}{15} \text{min}^{-1}, meaning in 15 minutes, the coffee was cooled by e^{-1} of its temperature difference from the environment.

From the SciPy documentation of odeint, the first three parameters are important:

scipy.integrate.odeint(func, y0, t, args=(), Dfun=None, col_deriv=0, full_output=0, ml=None, mu=None, rtol=None,atol=None, tcrit=None, h0=0.0, hmax=0.0, hmin=0.0, ixpr=0, mxstep=0, mxhnil=0, mxordn=12, mxords=5,printmessg=0)

The first parameter is the cooling function, and the second is the initial value of the dependent function (temperature in this case), and the third is the ndarray of times of the temperatures to be calculated. We can easily write the functions that return the ndarray of the temperatures for the cases of adding sugar cube before and after the cooling process respectively:

# adding sugar cube before cooling
def cube_before_cooling(t, init_temp, cube_dectemp, roomtemp, k):
    y0 = init_temp - cube_dectemp
    temps = odeint(partial(coolingfcn, k=k, roomtemp=roomtemp), y0, t)
    return temps

# adding sugar cube after cooling
def cube_after_cooling(t, init_temp, cube_dectemp, roomtemp, k):
    temps = odeint(partial(coolingfcn, k=k, roomtemp=roomtemp), init_temp, t)
    temps[-1] -= cube_dectemp
    return temps

We can plot the temperature changes on graphs using matplotlib.

import matplotlib.pyplot as plt

And we are interested in the first 20 minutes:

times = np.linspace(0, 20, 201)

For adding a sugar cube before cooling,

temps_beforecooling = cube_before_cooling(times, **cooling_parameters)
plt.plot(times, temps_beforecooling)
plt.xlabel('Time (min)')
plt.ylabel('Coffee Temperature (degrees Celcius)')

(If you are typing in an iPython shell, you need to put plt.show() to show the plot.) This gives the following plot:

download (2)

And for adding the sugar cube after cooling,

temps_aftercooling = cube_after_cooling(times, **cooling_parameters)
plt.plot(times, temps_aftercooling)
plt.xlabel('Time (min)')
plt.ylabel('Coffee Temperature (degrees Celcius)')

This gives the following plot:

download (3)

Obviously, adding the sugar cube after cooling gives a lower temperature.

Advanced Gimmicks

How about we add sugar continuously throughout the 20-minute time span?  We need to adjust our differential equation to be:

\frac{dT}{dt} = -k (T-T_r) - \frac{T_{\text{drop}}}{\delta t},

which can be implemented by the following code:

# adding sugar continuously
def continuous_sugar_cooling(t, init_temp, cube_dectemp, roomtemp, k):
    timespan = t[-1] - t[0]
    newcoolingfcn = lambda temp, t0: coolingfcn(temp, t0, k, roomtemp) - cube_dectemp / timespan
    temps = odeint(newcoolingfcn, init_temp, t)
    return temps

or we can divide the cube to be added to the cup at a few time spots, which can be implemented by the following code that employs our previously defined functions:

# adding sugar at a specified time(s)
def sugar_specifiedtime_cooling(t, sugar_time_indices, init_temp, cube_dectemp, roomtemp, k):
    sorted_sugar_time_indices = np.sort(sugar_time_indices)
    num_portions = len(sorted_sugar_time_indices)

    temps = np.array([])
    temp = init_temp
    t_segments = np.split(t, sorted_sugar_time_indices)
    num_segments = len(t_segments)
    for i in range(num_segments):
        temp_segment = cube_after_cooling(t_segments[i],
                                          temp,
                                          float(cube_dectemp)/num_portions,
                                          roomtemp,
                                          k)
        temps = np.append(temps, temp_segment)
        temp = temp_segment[-1]
    temps[-1] += float(cube_dectemp)/num_portions

    return temps

Let’s calculate all the temperatures:

temps_cont = continuous_sugar_cooling(times, **cooling_parameters)
temps_n = sugar_specifiedtime_cooling(times, [50, 100, 150], **cooling_parameters)

And plot all the graphs together:

plt.plot(times, temps_beforecooling, label='cube before cooling')
plt.plot(times, temps_aftercooling, label='cube after cooling')
plt.plot(times, temps_cont, label='continuous')
plt.plot(times, temps_n, label='3 times of dropping cube')
plt.xlabel('Time (min)')
plt.ylabel('Coffee Temperature (degrees Celcius)')
plt.legend()

download (4)

Complete code demonstration can be found in my github (GitHub/stephenhky/CoffeeCooling)

Continue reading “Coffee Cooling: a Demo of Numerical ODEs in Python”

textmineR: a New Text Mining Package for R

Previously, I wrote an entry on text mining on R and Python, and did a comparison. However, the text mining package employed was tm for R. But it has some problems:

  1. The syntax is not natural for an experienced R users.
  2. tm uses simple_triplet_matrix from the slam library for document-term matrix (DTM) and term-occurrence matrix (TCM), which is not as widely used as dgCMatrix from the Matrix library.

Tommy Jones, a Ph.D. student in George Mason University, and a data scientist at Impact Research, developed an alternative text mining package called textmineR. He presented in a Stat Prog DC Meetup on April 27, 2016. It employed a better syntax, and dgCMatrix. All in all, it is a wrapper for a lot of existing R packages to facilitate the text mining process, like creating DTM matrices with stopwords or appropriate stemming/lemmatizing functions. Here is a sample code to create a DTM with the example from the previous entry:

library(tm)
library(textmineR)

texts <- c('I love Python.',
           'R is good for analytics.',
           'Mathematics is fun.')

dtm<-CreateDtm(texts,
               doc_names = c(1:length(texts)),
               ngram_window = c(1, 1),
               stopword_vec = c(tm::stopwords('english'), tm::stopwords('SMART')),
               lower = TRUE,
               remove_punctuation = TRUE,
               remove_numbers = TRUE
               )

The DTM is a sparse matrix:

3 x 6 sparse Matrix of class &amp;quot;dgCMatrix&amp;quot;
  analytics fun mathematics good python love
1         .   .           .    .      1    1
2         1   .           .    1      .    .
3         .   1           1    .      .    .

On the other hand, it wraps text2vec, an R package that wraps the word-embedding algorithm named gloVe. And it wraps a number of topic modeling algorithms, such as latent Dirichlet allocation (LDA) and correlated topic models (CTM).

In addition, it contains a parallel computing loop function called TmParallelApply, analogous to the original R parallel loop function mclapply, but TmParallelApply works on Windows as well.

textmineR is an open-source project, with source code available on github, which contains his example codes.

Continue reading “textmineR: a New Text Mining Package for R”

Word Embedding Algorithms

Embedding has been hot in recent years partly due to the success of Word2Vec, (see demo in my previous entry) although the idea has been around in academia for more than a decade. The idea is to transform a vector of integers into continuous, or embedded, representations. Keras, a Python package that implements neural network models (including the ANN, RNN, CNN etc.) by wrapping Theano or TensorFlow, implemented it, as shown in the example below (which converts a vector of 200 features into a continuous vector of 10):

from keras.layers import Embedding
from keras.models import Sequential

# define and compile the embedding model
model = Sequential()
model.add(Embedding(200, 10, input_length=1))
model.compile('rmsprop', 'mse')  # optimizer: rmsprop; loss function: mean-squared error

We can then convert any features from 0 to 199 into vectors of 20, as shown below:

import numpy as np

model.predict(np.array([10, 90, 151]))

It outputs:

array([[[ 0.02915354,  0.03084954, -0.04160764, -0.01752155, -0.00056815,
         -0.02512387, -0.02073313, -0.01154278, -0.00389587, -0.04596512]],

       [[ 0.02981793, -0.02618774,  0.04137352, -0.04249889,  0.00456919,
          0.04393572,  0.04139435,  0.04415271,  0.02636364, -0.04997493]],

       [[ 0.00947296, -0.01643104, -0.03241419, -0.01145032,  0.03437041,
          0.00386361, -0.03124221, -0.03837727, -0.04804075, -0.01442516]]])

Of course, one must not omit a similar algorithm called GloVe, developed by the Stanford NLP group. Their codes have been wrapped in both Python (package called glove) and R (library called text2vec).

Besides Word2Vec, there are other word embedding algorithms that try to complement Word2Vec, although many of them are more computationally costly. Previously, I introduced LDA2Vec in my previous entry, an algorithm that combines the locality of words and their global distribution in the corpus. And in fact, word embedding algorithms with a similar ideas are also invented by other scientists, as I have introduced in another entry.

However, there are word embedding algorithms coming out. Since most English words carry more than a single sense, different senses of a word might be best represented by different embedded vectors. Incorporating word sense disambiguation, a method called sense2vec has been introduced by Trask, Michalak, and Liu. (arXiv:1511.06388). Matthew Honnibal wrote a nice blog entry demonstrating its use.

There are also other related work, such as wang2vec that is more sensitive to word orders.

Big Bang Theory (Season 2, Episode 5): Euclid Alternative

DMV staff: Application?
Sheldon: I’m actually more or a theorist.

Note: feature image taken from Big Bang Theory (CBS).

Continue reading “Word Embedding Algorithms”

Local and Global: Words and Topics

Word2Vec has hit the NLP world for a while, as it is a nice method for word embeddings or word representations. Its use of skip-gram model and deep learning made a big impact too. It has been my favorite toy indeed. However, even though the words do have a correlation across a small segment of text, it is still a local coherence. On the other hand, topic models such as latent Dirichlet allocation (LDA) capture the distribution of words within a topic, and that of topics within a document etc. And it provides a representation of a new document in terms of a topic.

In my previous blog entry, I introduced Chris Moody’s LDA2Vec algorithm (see: his SlideShare). Unfortunately, not many papers or blogs have covered this new algorithm too much despite its potential. The API is not completely well documented yet, although you can see its example from its source code on its Github. In its documentation, it gives an example of deriving topics from an array of random numbers, in its lda2vec/lda2vec.py code:

from lda2vec import LDA2Vec
n_words = 10
n_docs = 15
n_hidden = 8
n_topics = 2
n_obs = 300
words = np.random.randint(n_words, size=(n_obs))
_, counts = np.unique(words, return_counts=True)
model = LDA2Vec(n_words, n_hidden, counts)
model.add_categorical_feature(n_docs, n_topics, name='document id')
model.finalize()
doc_ids = np.arange(n_obs) % n_docs
loss = model.fit_partial(words, 1.0, categorical_features=doc_ids)

A more comprehensive example is in examples/twenty_newsgroup/lda.py .

Besides, LDA2Vec, there are some related research work on topical word embeddings too. A group of Australian and American scientists studied about the topic modeling with pre-trained Word2Vec (or GloVe) before performing LDA. (See: their paper and code) On the other hand, another group with Chinese and Singaporean scientists performs LDA, and then trains a Word2Vec model. (See: their paper and code) And LDA2Vec concatenates the Word2Vec and LDA representation, like an early fusion.

No matter what, representations with LDA models (or related topic modeling such as correlated topic models (CTM)) can be useful even outside NLP. I have found it useful at some intermediate layer of calculation lately.

proj3_lda20structure
Graphical Representations of LDA model. Source: http://salsahpc.indiana.edu/b649proj/images/proj3_LDA%20structure.png

Continue reading “Local and Global: Words and Topics”

Flesch-Kincaid Readability Measure

In 1970s, long before artificial intelligence and natural language processing becoming hot, there have already been metrics to measure the ease of reading, or readability, a certain text. these metrics were designed in order to limit the difficulty level of government, legal, and commercial documents.

Flesch-Kincaid readability measures, developed by the United States Navy, are some of the popular measures. There are two metrics under this umbrella, namely, Flesch readability ease, and Flesch-Kincaid grade level. Despite their distinction, the intuition of both measures are that a text is more difficult to read if 1) there are more words in a sentence on average, and 2) the words are longer, or have more syllables. It makes #words/#sentences and #syllables/#words important terms in both metrics. The formulae for both metrics are given as:

\text{Flesch readability ease} = 206.835 - 1.015 \frac{\text{number of words}}{\text{number of sentences}} - 84.6 \frac{\text{number of syllables}}{\text{number of words}},

\text{Flesch-Kincaid grade level} = 0.39 \frac{\text{number of words}}{\text{number of sentences}} + 11.8 \frac{\text{number of syllables}}{\text{number of words}} - 15.59.

Therefore, the more difficult the passage is, the lower its Flesch readability ease, and the higher its Flesch-Kincaid grade level.

With the packages of natural language processing, it is not at all difficult to calculate these metrics. We can apply the NLTK library in Python. To calculate the numbers of words and sentences in a text, we need the tokenizers, which can be imported easily.

from nltk.tokenize import sent_tokenize, word_tokenize

And the counts can be easily implemented with the following functions:

not_punctuation = lambda w: not (len(w)==1 and (not w.isalpha()))
get_word_count = lambda text: len(filter(not_punctuation, word_tokenize(text)))
get_sent_count = lambda text: len(sent_tokenize(text))

The first function, not_punctuation, is used to filter out tokens that are not English words. For the number of syllables, we need the Carnegie Mellon University (CMU) Pronouncing Dictionary, which is also included in NLTK:

from nltk.corpus import cmudict
prondict = cmudict.dict()

It would be helpful to go through some examples. This dictionary outputs the pronunciation. For example, by typing prondict[‘apple’], it gives:

[[u'AE1', u'P', u'AH0', u'L']]

Note that the vowels are with a digit at the end. By counting the number of these digits, we retrieve the number of syllables. It would be useful to go through an example of a word with more than one pronunciations, such as prondict[‘orange’] gives:

[[u'AO1', u'R', u'AH0', u'N', u'JH'], [u'AO1', u'R', u'IH0', u'N', u'JH']]

If the word is not in the dictionary, it throws a <pre>KeyError</pre>. We can implement the counting of syllables by the following code:

numsyllables_pronlist = lambda l: len(filter(lambda s: isdigit(s.encode('ascii', 'ignore').lower()[-1]), l))
def numsyllables(word):
  try:
    return list(set(map(numsyllables_pronlist, prondict[word.lower()])))
  except KeyError:
    return [0]

For simplicity, if there are more than one pronunciations, I take the largest number of syllables in subsequent calculations. Then the counts of words, sentences, and syllables can be summarized in the following function:

def text_statistics(text):
  word_count = get_word_count(text)
  sent_count = get_sent_count(text)
  syllable_count = sum(map(lambda w: max(numsyllables(w)), word_tokenize(text)))
  return word_count, sent_count, syllable_count

And the two metrics can be implemented as:

flesch_formula = lambda word_count, sent_count, syllable_count : 206.835 - 1.015*word_count/sent_count - 84.6*syllable_count/word_count
def flesch(text):
  word_count, sent_count, syllable_count = text_statistics(text)
  return flesch_formula(word_count, sent_count, syllable_count)

fk_formula = lambda word_count, sent_count, syllable_count : 0.39 * word_count / sent_count + 11.8 * syllable_count / word_count - 15.59
def flesch_kincaid(text):
  word_count, sent_count, syllable_count = text_statistics(text)
  return fk_formula(word_count, sent_count, syllable_count)

Let’s go through a few examples. We can access the text of MacBeth written by William Shakespeare by accessing the Gutenberg corpus:

from nltk.corpus import gutenberg
macbeth = gutenberg.raw('shakespeare-macbeth.txt')
print flesch(macbeth)
print flesch_kincaid(macbeth)

This prints 112.27804859129883, and 0.6579340562875089, respectively, indicating it is easy to understand. The next example is the King James Version (KJV) of the Holy Bible:

kjv = gutenberg.raw('bible-kjv.txt')
print flesch(kjv)
print flesch_kincaid(kjv)

This prints 79.64174894275615, and 9.008527536596926, respectively, implying that it is less easy to understand.

Other metrics include Gunning fox index.

Last month, Matthew Lipson wrote on his blog about the language used by the candidates of the 2016 Presidential Elections for the United States. The metrics introduced can be used as an indication of the literary level of the candidates. In the above two metrics, Hilary Clinton scores the most in readability, and Donald Trump the least.

Continue reading “Flesch-Kincaid Readability Measure”

Persistence

Previously, I have went through heuristically the description of topology using homology groups in this entry. [Ho 2015] This is the essence of algebraic topology. We describe the topology using Betti numbers, the rank of the homolog groups. What they mean can be summarized as: [Bubenik 2015]

“… homology in degree 0 describes the connectedness of the data; homology in degree 1 detects holes and tunnels; homology in degree 2 captures voids; and so on.

Concept of Persistence

However, in computational problems, it is the discrete points that we are dealing with. We formulate their connectedness through constructing complexes, as described by my another blog entry. [Ho 2015] From the Wolfram Demonstration that I quoted previously, connectedness depends on some parameters, such as the radii of points that are considered connected. Whether it is Čech Complex, RP complex, or Alpha complex, the idea is similar. With discrete data, therefore, there is no definite answer how the connectedness among the points are, as it depends on the parameters.

Therefore, the concept of persistence has been developed to tackle this problem. This is the core concept for computational topology. There are a lot of papers about persistence, but the most famous work is done by Zomorodian and Carlsson, who algebraically studied it. [Zomorodian & Carlsson 2005] The idea is that as one increases the radii of points, the complexes change, and so do the homology groups. By varying the radii, we can observe which topology persists.

barcode-crop
Persistence, and barcodes [Christ 2008]
From the diagram above, we can see that as the radii ε increase, the diagram becomes more connected. To understand the changes of homologies, there are a few ways. In the diagram above, barcodes represent the “life span” of a connected component as ε increases. The Betti numbers of a certain degree (0, 1, or 2 in this example) at a certain value of ε is the number of barcodes at that degree. For example, look at the left most vertical dashed line, \beta_0=10, as there are 10 barcodes existing for H_0. Note there are indeed 10 separate connected components. For the second leftmost vertical dashed line, \beta_0=6 (6 connected components), and \beta_1=2 (2 holes).

Another way is using the persistence diagram, basically plotting the “birth” and “death” times of all the barcodes above. For an explanation of persistence diagram, please refer to this blog entry by Sebastien Bubeck, [Bubeck 2013] or the paper by Fasy et. al. [Fasy et. al. 2014] Another way to describe persistent topology is the persistence landscape. [Bubenik 2015]

TDA Package in R

There are a lot of tools to perform topological data analysis. Ayasdi Core is a famous one. There are open sources C++ libraries such as Dionysus, or PHAT. There is a Python binding for Dionysus too.

There is a package in R that wraps Dionysus and PHAT, called TDA. To install it, simply open an R session, and enter

install.package('TDA')

To load it, simply enter

library(TDA)

We know that for a circle, \beta_0=\beta_1=1, as it has on connected components, and a hole. Prepare the circle and store it in X by the function circleUnif:

X<- circleUnif(n=1000, r=1)
plot(X)

Then we can see a 2-dimensional circle like this:
Rplot

To calculate the persistent homology, use the function gridDiag:

diag.info<- gridDiag(X=X, FUN=kde, h=0.3, lim=cbind(c(-1, 1), c(-1, 1)), by=0.01, sublevel = FALSE, library = 'PHAT', printProgress=FALSE)

To plot the barcodes and persistence diagram, enter:

par(mfrow=c(2,1))
plot(diag.info$diagram)
plot(diag.info$diagram, barcode=TRUE)

Rplot01

In the plots, black refers to degree 0, and red refers to degree 1.

We can play the same game by adding a horizontal line to cut the circle into two halves:

X<- circleUnif(n=1000, r=1)
hl<- matrix(c(seq(-1, 1, 0.02), rep(0, 201)), ncol=2)
X<- rbind(X, hl)
plot(X)

Rplot02
And the barcodes and persistence diagram are:

Rplot03.png

We can try this with three-dimensional objects like sphere, or torus, but I never finished the calculation in reasonable speeds.

Continue reading “Persistence”

Homology and Betti Numbers

We have been talking about the elements of topological data analysis. In my previous post, I introduced simplicial complexes, concerning the ways to connect points together. In topology, it is the shape and geometry, not distances, which matter ( although while constructing the distance does play a role).

With the simplicial complexes, we can go ahead to describe its topology. We will use the techniques in algebraic topology without going into too much details. The techniques involves homology, but a full explanation of it requires the concepts of normal subgroup, kernel, image, quotient group in group theory. I will not talk about them, although I admit that there is no easy ways to talk about computational topology without touching them. I highly recommend the readers can refer to Zomorodian’s textbook for more details. [Zomorodian 2009]

I will continue with the Python class

SimplicialComplex

that I wrote in the previous blog post. Suppose we have an k-simplex, then the n-th face is any combinations with n+1 vertices. A simplicial complex is such that a face contained in a face is also a face of the complex. In this, we can define the boundary operator by

\partial_k \sigma = \sum_i (-1)^i [v_0 v_1 \ldots \hat{v}_i \ldots v_k],

where \hat{v}_i indicates the i-th vertex be removed. This operator gives all the boundary faces of a face \sigma. The faces being operated are k-faces, and this operator will be mapped to a (k-1)-faces. Then the boundary operator can be seen as a (n_k \times n_{k-1})-matrix, where n_k is the number of k-faces. This can be easily calculated with the following method:

class SimplicialComplex:
  ...
  def boundary_operator(self, i):
    source_simplices = self.n_faces(i)
    target_simplices = self.n_faces(i-1)

    if len(target_simplices)==0:
      S = dok_matrix((1, len(source_simplices)), dtype=np.float32)
      S[0, 0:len(source_simplices)] = 1
    else:
      source_simplices_dict = {}
      for j in range(len(source_simplices)):
        source_simplices_dict[source_simplices[j]] = j
      target_simplices_dict = {}
      for i in range(len(target_simplices)):
        target_simplices_dict[target_simplices[i]] = i

      S = dok_matrix((len(target_simplices), len(source_simplices)), dtype=np.float32)
      for source_simplex in source_simplices:
        for a in range(len(source_simplex)):
          target_simplex = source_simplex[:a]+source_simplex[(a+1):]
          i = target_simplices_dict[target_simplex]
          j = source_simplices_dict[source_simplex]
          S[i, j] = -1 if a % 2==1 else 1 # S[i, j] = (-1)**a
   return S

With the boundary operator, we can calculate the Betti numbers that characterize uniquely the topology of the shapes. Actually it involves the concept of homology groups that we are going to omit. To calculate the k-th Betti numbers, we calculate:

\beta_k = \text{rank} (\text{ker} \partial_k) - \text{rank} (\text{Im} \partial_{k+1}).

By rank-nullity theorem, [Jackson]

\text{rank} (\text{ker} \partial_k) +\text{rank} (\text{Im} \partial_k) = \text{dim} (\partial_k)

the Betti number is then

\beta_k = \text{dim} (\partial_k) - \text{rank}(\text{Im} \partial_k)) - \text{rank} (\text{Im} \partial_{k+1})

where the rank of the image of an operator can be easily computed using the rank method available in numpy. Then the method of calculating the Betti number is

class SimplicialComplex:
  ...
  def betti_number(self, i):
    boundop_i = self.boundary_operator(i)
    boundop_ip1 = self.boundary_operator(i+1)

    if i==0:
      boundop_i_rank = 0
    else:
      try:
        boundop_i_rank = np.linalg.matrix_rank(boundop_i.toarray())
      except np.linalg.LinAlgError:
        boundop_i_rank = boundop_i.shape[1]
    try:
      boundop_ip1_rank = np.linalg.matrix_rank(boundop_ip1.toarray())
    except np.linalg.LinAlgError:
      boundop_ip1_rank = boundop_ip1.shape[1]

    return ((boundop_i.shape[1]-boundop_i_rank)-boundop_ip1_rank)

If we draw a simplicial complex on a 2-dimensional plane, we almost have \beta_0, \beta_1 and \beta_2. $\beta_0$ indicates the number of components, \beta_1 the number of bases for a tunnel, and \beta_2 the number of voids.

Let’s have some examples. Suppose we have a triangle, not filled.

e1 = [(0, 1), (1, 2), (2, 0)]
sc1 = SimplicialComplex(e1)

Then the Betti numbers are:


In [5]: sc1.betti_number(0)
Out[5]: 1

In [6]: sc1.betti_number(1)
Out[6]: 1

In [7]: sc1.betti_number(2)
Out[7]: 0

Let’s try another example with multiple components.

e2 = [(1,2), (2,3), (3,1), (4,5,6), (6,7), (7,4)]
sc2 = SimplicialComplex(e2)

We can graphically represent it using networkx:

import networkx as nx
import matplotlib.pyplot as plt
n2 = nx.Graph()
n2.add_edges_from(sc2.n_faces(1))
nx.draw(n2)
plt.show()
Simplicial Complex of e2
Simplicial Complex of e2

And its Betti numbers are as follow:


In [13]: sc2.betti_number(0)
Out[13]: 2

In [14]: sc2.betti_number(1)
Out[14]: 2

In [15]: sc2.betti_number(2)
Out[15]: 0

A better illustration is the Wolfram Demonstration, titled “Simplicial Homology of the Alpha Complex”.

On top of the techniques in this current post, we can describe the homology of discrete points using persistent homology, which I will describe in my future posts. I will probably spend a post on homotopy in comparison to other types of quantitative problems.

Continue reading “Homology and Betti Numbers”

Create a free website or blog at WordPress.com.

Up ↑