The first presidential debate 2016 was held on September 26, 2016 in Hofstra University in New York. An interesting analysis will be the literacy level demonstrated by the two candidates using Flesch readability ease and Flesch-Kincaid grade level, demonstrated in my previous blog entry and my Github: stephenhky/PyReadability.
First, we need to get the transcript of the debate, which can be found in an article in New York Times. Copy and paste the text into a file called first_debate_transcript.txt. Then we want to extract out speech of each person. To do this, store the following Python code in first_debate_segment.py.
# Trump and Clinton 1st debate on Sept 26, 2016 from nltk import word_tokenize from collections import defaultdict import re # adopted from http://stackoverflow.com/questions/21948019/python-untokenize-a-sentence def untokenize(words): """ Untokenizing a text undoes the tokenizing operation, restoring punctuation and spaces to the places that people expect them to be. Ideally, `untokenize(tokenize(text))` should be identical to `text`, except for line breaks. """ text = ' '.join(words) step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .', '...') step2 = step1.replace(" ( ", " (").replace(" ) ", ") ") step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2) step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3) step5 = step4.replace(" '", "'").replace(" n't", "n't").replace( "can not", "cannot") step6 = step5.replace(" ` ", " '") return step6.strip() ignored_phrases = ['(APPLAUSE)', '(CROSSTALK)'] persons = ['TRUMP', 'CLINTON', 'HOLT'] fin = open('first_debate_transcript.txt', 'rb') lines = fin.readlines() fin.close() lines = filter(lambda s: len(s)>0, map(lambda s: s.strip(), lines)) speeches = defaultdict(lambda : '') person = None for line in lines: tokens = word_tokenize(line.strip()) ignore_colon = False added_tokens =  for token in tokens: if token in ignored_phrases: pass elif token in persons: person = token ignore_colon = True elif token == ':': ignore_colon = False else: added_tokens += [token] speeches[person] += ' ' + untokenize(added_tokens) for person in persons: fout = open('speeches_'+person+'.txt', 'wb') fout.write(speeches[person]) fout.close()
There is an untokenize function adapted from a code in StackOverflow. This segmented the transcript into the individual speech of Lester Holt (the host of the debate), Donald Trump (GOP presidential candidate), and Hillary Clinton (DNC presidential candidate) in separate files. Then, on UNIX or Linux command line, run score_readability.py on each person’s script, by, for example, for Holt’s speech,
python score_readability.py speeches_HOLT.txt --utf8
Beware that it is encoded in UTF-8. For Lester Holt, we have
Word count = 1935 Sentence count = 157 Syllable count = 2732 Flesch readability ease = 74.8797052289 Flesch-Kincaid grade level = 5.87694629602
For Donald Trump,
Word count = 8184 Sentence count = 693 Syllable count = 10665 Flesch readability ease = 84.6016324536 Flesch-Kincaid grade level = 4.3929136992
And for Hillary Clinton,
Word count = 6179 Sentence count = 389 Syllable count = 8395 Flesch readability ease = 75.771973015 Flesch-Kincaid grade level = 6.63676650035
Apparently, compared to Donald Trump, Hillary Clinton has a higher literary level, but her speech is less easy to understand.
Recalling from my previous entry, for Shakespeare’s MacBeth, the Flesch readability ease is 112.278048591, and Flesch-Kincard grade level 0.657934056288; for King James Version Bible (KJV), they are 79.6417489428 and 9.0085275366 respectively.
This is just a simple text analytics. However, the content is not analyzed here. Augustine of Hippo wrote in his Book IV of On Christian Teaching (Latin: De doctrina christiana) about rhetoric and eloquence:
“… wisdom without eloquence is of little value to the society… eloquence without wisdom is… a great nuisance, and never beneficial.” — Augustine of Hippo, Book IV of On Christian Teaching