Simple Literary Analytics on Presidential Candidates in the First 2016 Presidential Debate

The first presidential debate 2016 was held on September 26, 2016 in Hofstra University in New York. An interesting analysis will be the literacy level demonstrated by the two candidates using Flesch readability ease and Flesch-Kincaid grade level, demonstrated in my previous blog entry and my Github: stephenhky/PyReadability.

First, we need to get the transcript of the debate, which can be found in an article in New York Times. Copy and paste the text into a file called first_debate_transcript.txt. Then we want to extract out speech of each person. To do this, store the following Python code in first_debate_segment.py.

# Trump and Clinton 1st debate on Sept 26, 2016

from nltk import word_tokenize
from collections import defaultdict
import re

# adopted from http://stackoverflow.com/questions/21948019/python-untokenize-a-sentence
def untokenize(words):
    """
    Untokenizing a text undoes the tokenizing operation, restoring
    punctuation and spaces to the places that people expect them to be.
    Ideally, `untokenize(tokenize(text))` should be identical to `text`,
    except for line breaks.
    """
    text = ' '.join(words)
    step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .',  '...')
    step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
    step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2)
    step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3)
    step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
         "can not", "cannot")
    step6 = step5.replace(" ` ", " '")
    return step6.strip()

ignored_phrases = ['(APPLAUSE)', '(CROSSTALK)']
persons = ['TRUMP', 'CLINTON', 'HOLT']
fin = open('first_debate_transcript.txt', 'rb')
lines = fin.readlines()
fin.close()

lines = filter(lambda s: len(s)>0, map(lambda s: s.strip(), lines))
speeches = defaultdict(lambda : '')
person = None

for line in lines:
    tokens = word_tokenize(line.strip())
    ignore_colon = False
    added_tokens = []
    for token in tokens:
        if token in ignored_phrases:
            pass
        elif token in persons:
            person = token
            ignore_colon = True
        elif token == ':':
            ignore_colon = False
        else:
            added_tokens += [token]
            speeches[person] += ' ' + untokenize(added_tokens)

for person in persons:
    fout = open('speeches_'+person+'.txt', 'wb')
    fout.write(speeches[person])
    fout.close()

There is an untokenize function adapted from a code in StackOverflow. This segmented the transcript into the individual speech of Lester Holt (the host of the debate), Donald Trump (GOP presidential candidate), and Hillary Clinton (DNC presidential candidate) in separate files. Then, on UNIX or Linux command line, run score_readability.py on each person’s script, by, for example, for Holt’s speech,

python score_readability.py speeches_HOLT.txt --utf8

Beware that it is encoded in UTF-8. For Lester Holt, we have

Word count = 1935
Sentence count = 157
Syllable count = 2732
Flesch readability ease = 74.8797052289
Flesch-Kincaid grade level = 5.87694629602

For Donald Trump,

Word count = 8184
Sentence count = 693
Syllable count = 10665
Flesch readability ease = 84.6016324536
Flesch-Kincaid grade level = 4.3929136992

And for Hillary Clinton,

Word count = 6179
Sentence count = 389
Syllable count = 8395
Flesch readability ease = 75.771973015
Flesch-Kincaid grade level = 6.63676650035

Apparently, compared to Donald Trump, Hillary Clinton has a higher literary level, but her speech is less easy to understand.

Recalling from my previous entry, for Shakespeare’s MacBeth, the Flesch readability ease is 112.278048591, and Flesch-Kincard grade level 0.657934056288; for King James Version Bible (KJV), they are 79.6417489428 and 9.0085275366 respectively.

This is just a simple text analytics. However, the content is not analyzed here. Augustine of Hippo wrote in his Book IV of On Christian Teaching (Latin: De doctrina christiana) about rhetoric and eloquence:

“… wisdom without eloquence is of little value to the society… eloquence without wisdom is… a great nuisance, and never beneficial.” — Augustine of Hippo, Book IV of On Christian Teaching

694940094001_5142607252001_highlights-from-the-first-presidential-debate

Continue reading “Simple Literary Analytics on Presidential Candidates in the First 2016 Presidential Debate”

Flesch-Kincaid Readability Measure

In 1970s, long before artificial intelligence and natural language processing becoming hot, there have already been metrics to measure the ease of reading, or readability, a certain text. these metrics were designed in order to limit the difficulty level of government, legal, and commercial documents.

Flesch-Kincaid readability measures, developed by the United States Navy, are some of the popular measures. There are two metrics under this umbrella, namely, Flesch readability ease, and Flesch-Kincaid grade level. Despite their distinction, the intuition of both measures are that a text is more difficult to read if 1) there are more words in a sentence on average, and 2) the words are longer, or have more syllables. It makes #words/#sentences and #syllables/#words important terms in both metrics. The formulae for both metrics are given as:

\text{Flesch readability ease} = 206.835 - 1.015 \frac{\text{number of words}}{\text{number of sentences}} - 84.6 \frac{\text{number of syllables}}{\text{number of words}},

\text{Flesch-Kincaid grade level} = 0.39 \frac{\text{number of words}}{\text{number of sentences}} + 11.8 \frac{\text{number of syllables}}{\text{number of words}} - 15.59.

Therefore, the more difficult the passage is, the lower its Flesch readability ease, and the higher its Flesch-Kincaid grade level.

With the packages of natural language processing, it is not at all difficult to calculate these metrics. We can apply the NLTK library in Python. To calculate the numbers of words and sentences in a text, we need the tokenizers, which can be imported easily.

from nltk.tokenize import sent_tokenize, word_tokenize

And the counts can be easily implemented with the following functions:

not_punctuation = lambda w: not (len(w)==1 and (not w.isalpha()))
get_word_count = lambda text: len(filter(not_punctuation, word_tokenize(text)))
get_sent_count = lambda text: len(sent_tokenize(text))

The first function, not_punctuation, is used to filter out tokens that are not English words. For the number of syllables, we need the Carnegie Mellon University (CMU) Pronouncing Dictionary, which is also included in NLTK:

from nltk.corpus import cmudict
prondict = cmudict.dict()

It would be helpful to go through some examples. This dictionary outputs the pronunciation. For example, by typing prondict[‘apple’], it gives:

[[u'AE1', u'P', u'AH0', u'L']]

Note that the vowels are with a digit at the end. By counting the number of these digits, we retrieve the number of syllables. It would be useful to go through an example of a word with more than one pronunciations, such as prondict[‘orange’] gives:

[[u'AO1', u'R', u'AH0', u'N', u'JH'], [u'AO1', u'R', u'IH0', u'N', u'JH']]

If the word is not in the dictionary, it throws a <pre>KeyError</pre>. We can implement the counting of syllables by the following code:

numsyllables_pronlist = lambda l: len(filter(lambda s: isdigit(s.encode('ascii', 'ignore').lower()[-1]), l))
def numsyllables(word):
  try:
    return list(set(map(numsyllables_pronlist, prondict[word.lower()])))
  except KeyError:
    return [0]

For simplicity, if there are more than one pronunciations, I take the largest number of syllables in subsequent calculations. Then the counts of words, sentences, and syllables can be summarized in the following function:

def text_statistics(text):
  word_count = get_word_count(text)
  sent_count = get_sent_count(text)
  syllable_count = sum(map(lambda w: max(numsyllables(w)), word_tokenize(text)))
  return word_count, sent_count, syllable_count

And the two metrics can be implemented as:

flesch_formula = lambda word_count, sent_count, syllable_count : 206.835 - 1.015*word_count/sent_count - 84.6*syllable_count/word_count
def flesch(text):
  word_count, sent_count, syllable_count = text_statistics(text)
  return flesch_formula(word_count, sent_count, syllable_count)

fk_formula = lambda word_count, sent_count, syllable_count : 0.39 * word_count / sent_count + 11.8 * syllable_count / word_count - 15.59
def flesch_kincaid(text):
  word_count, sent_count, syllable_count = text_statistics(text)
  return fk_formula(word_count, sent_count, syllable_count)

Let’s go through a few examples. We can access the text of MacBeth written by William Shakespeare by accessing the Gutenberg corpus:

from nltk.corpus import gutenberg
macbeth = gutenberg.raw('shakespeare-macbeth.txt')
print flesch(macbeth)
print flesch_kincaid(macbeth)

This prints 112.27804859129883, and 0.6579340562875089, respectively, indicating it is easy to understand. The next example is the King James Version (KJV) of the Holy Bible:

kjv = gutenberg.raw('bible-kjv.txt')
print flesch(kjv)
print flesch_kincaid(kjv)

This prints 79.64174894275615, and 9.008527536596926, respectively, implying that it is less easy to understand.

Other metrics include Gunning fox index.

Last month, Matthew Lipson wrote on his blog about the language used by the candidates of the 2016 Presidential Elections for the United States. The metrics introduced can be used as an indication of the literary level of the candidates. In the above two metrics, Hilary Clinton scores the most in readability, and Donald Trump the least.

Continue reading “Flesch-Kincaid Readability Measure”

Create a free website or blog at WordPress.com.

Up ↑