In 1970s, long before artificial intelligence and natural language processing becoming hot, there have already been metrics to measure the ease of reading, or readability, a certain text. these metrics were designed in order to limit the difficulty level of government, legal, and commercial documents.
Flesch-Kincaid readability measures, developed by the United States Navy, are some of the popular measures. There are two metrics under this umbrella, namely, Flesch readability ease, and Flesch-Kincaid grade level. Despite their distinction, the intuition of both measures are that a text is more difficult to read if 1) there are more words in a sentence on average, and 2) the words are longer, or have more syllables. It makes #words/#sentences and #syllables/#words important terms in both metrics. The formulae for both metrics are given as:
Therefore, the more difficult the passage is, the lower its Flesch readability ease, and the higher its Flesch-Kincaid grade level.
With the packages of natural language processing, it is not at all difficult to calculate these metrics. We can apply the NLTK library in Python. To calculate the numbers of words and sentences in a text, we need the tokenizers, which can be imported easily.
from nltk.tokenize import sent_tokenize, word_tokenize
And the counts can be easily implemented with the following functions:
not_punctuation = lambda w: not (len(w)==1 and (not w.isalpha())) get_word_count = lambda text: len(filter(not_punctuation, word_tokenize(text))) get_sent_count = lambda text: len(sent_tokenize(text))
The first function, not_punctuation, is used to filter out tokens that are not English words. For the number of syllables, we need the Carnegie Mellon University (CMU) Pronouncing Dictionary, which is also included in NLTK:
from nltk.corpus import cmudict prondict = cmudict.dict()
It would be helpful to go through some examples. This dictionary outputs the pronunciation. For example, by typing prondict[‘apple’], it gives:
[[u'AE1', u'P', u'AH0', u'L']]
Note that the vowels are with a digit at the end. By counting the number of these digits, we retrieve the number of syllables. It would be useful to go through an example of a word with more than one pronunciations, such as prondict[‘orange’] gives:
[[u'AO1', u'R', u'AH0', u'N', u'JH'], [u'AO1', u'R', u'IH0', u'N', u'JH']]
If the word is not in the dictionary, it throws a <pre>KeyError</pre>. We can implement the counting of syllables by the following code:
numsyllables_pronlist = lambda l: len(filter(lambda s: isdigit(s.encode('ascii', 'ignore').lower()[-1]), l)) def numsyllables(word): try: return list(set(map(numsyllables_pronlist, prondict[word.lower()]))) except KeyError: return 
For simplicity, if there are more than one pronunciations, I take the largest number of syllables in subsequent calculations. Then the counts of words, sentences, and syllables can be summarized in the following function:
def text_statistics(text): word_count = get_word_count(text) sent_count = get_sent_count(text) syllable_count = sum(map(lambda w: max(numsyllables(w)), word_tokenize(text))) return word_count, sent_count, syllable_count
And the two metrics can be implemented as:
flesch_formula = lambda word_count, sent_count, syllable_count : 206.835 - 1.015*word_count/sent_count - 84.6*syllable_count/word_count def flesch(text): word_count, sent_count, syllable_count = text_statistics(text) return flesch_formula(word_count, sent_count, syllable_count) fk_formula = lambda word_count, sent_count, syllable_count : 0.39 * word_count / sent_count + 11.8 * syllable_count / word_count - 15.59 def flesch_kincaid(text): word_count, sent_count, syllable_count = text_statistics(text) return fk_formula(word_count, sent_count, syllable_count)
Let’s go through a few examples. We can access the text of MacBeth written by William Shakespeare by accessing the Gutenberg corpus:
from nltk.corpus import gutenberg macbeth = gutenberg.raw('shakespeare-macbeth.txt') print flesch(macbeth) print flesch_kincaid(macbeth)
This prints 112.27804859129883, and 0.6579340562875089, respectively, indicating it is easy to understand. The next example is the King James Version (KJV) of the Holy Bible:
kjv = gutenberg.raw('bible-kjv.txt') print flesch(kjv) print flesch_kincaid(kjv)
This prints 79.64174894275615, and 9.008527536596926, respectively, implying that it is less easy to understand.
Other metrics include Gunning fox index.
Last month, Matthew Lipson wrote on his blog about the language used by the candidates of the 2016 Presidential Elections for the United States. The metrics introduced can be used as an indication of the literary level of the candidates. In the above two metrics, Hilary Clinton scores the most in readability, and Donald Trump the least.