Flesch-Kincaid Readability Measure

In 1970s, long before artificial intelligence and natural language processing becoming hot, there have already been metrics to measure the ease of reading, or readability, a certain text. these metrics were designed in order to limit the difficulty level of government, legal, and commercial documents.

Flesch-Kincaid readability measures, developed by the United States Navy, are some of the popular measures. There are two metrics under this umbrella, namely, Flesch readability ease, and Flesch-Kincaid grade level. Despite their distinction, the intuition of both measures are that a text is more difficult to read if 1) there are more words in a sentence on average, and 2) the words are longer, or have more syllables. It makes #words/#sentences and #syllables/#words important terms in both metrics. The formulae for both metrics are given as:

\text{Flesch readability ease} = 206.835 - 1.015 \frac{\text{number of words}}{\text{number of sentences}} - 84.6 \frac{\text{number of syllables}}{\text{number of words}},

\text{Flesch-Kincaid grade level} = 0.39 \frac{\text{number of words}}{\text{number of sentences}} + 11.8 \frac{\text{number of syllables}}{\text{number of words}} - 15.59.

Therefore, the more difficult the passage is, the lower its Flesch readability ease, and the higher its Flesch-Kincaid grade level.

With the packages of natural language processing, it is not at all difficult to calculate these metrics. We can apply the NLTK library in Python. To calculate the numbers of words and sentences in a text, we need the tokenizers, which can be imported easily.

from nltk.tokenize import sent_tokenize, word_tokenize

And the counts can be easily implemented with the following functions:

not_punctuation = lambda w: not (len(w)==1 and (not w.isalpha()))
get_word_count = lambda text: len(filter(not_punctuation, word_tokenize(text)))
get_sent_count = lambda text: len(sent_tokenize(text))

The first function, not_punctuation, is used to filter out tokens that are not English words. For the number of syllables, we need the Carnegie Mellon University (CMU) Pronouncing Dictionary, which is also included in NLTK:

from nltk.corpus import cmudict
prondict = cmudict.dict()

It would be helpful to go through some examples. This dictionary outputs the pronunciation. For example, by typing prondict[‘apple’], it gives:

[[u'AE1', u'P', u'AH0', u'L']]

Note that the vowels are with a digit at the end. By counting the number of these digits, we retrieve the number of syllables. It would be useful to go through an example of a word with more than one pronunciations, such as prondict[‘orange’] gives:

[[u'AO1', u'R', u'AH0', u'N', u'JH'], [u'AO1', u'R', u'IH0', u'N', u'JH']]

If the word is not in the dictionary, it throws a <pre>KeyError</pre>. We can implement the counting of syllables by the following code:

numsyllables_pronlist = lambda l: len(filter(lambda s: isdigit(s.encode('ascii', 'ignore').lower()[-1]), l))
def numsyllables(word):
  try:
    return list(set(map(numsyllables_pronlist, prondict[word.lower()])))
  except KeyError:
    return [0]

For simplicity, if there are more than one pronunciations, I take the largest number of syllables in subsequent calculations. Then the counts of words, sentences, and syllables can be summarized in the following function:

def text_statistics(text):
  word_count = get_word_count(text)
  sent_count = get_sent_count(text)
  syllable_count = sum(map(lambda w: max(numsyllables(w)), word_tokenize(text)))
  return word_count, sent_count, syllable_count

And the two metrics can be implemented as:

flesch_formula = lambda word_count, sent_count, syllable_count : 206.835 - 1.015*word_count/sent_count - 84.6*syllable_count/word_count
def flesch(text):
  word_count, sent_count, syllable_count = text_statistics(text)
  return flesch_formula(word_count, sent_count, syllable_count)

fk_formula = lambda word_count, sent_count, syllable_count : 0.39 * word_count / sent_count + 11.8 * syllable_count / word_count - 15.59
def flesch_kincaid(text):
  word_count, sent_count, syllable_count = text_statistics(text)
  return fk_formula(word_count, sent_count, syllable_count)

Let’s go through a few examples. We can access the text of MacBeth written by William Shakespeare by accessing the Gutenberg corpus:

from nltk.corpus import gutenberg
macbeth = gutenberg.raw('shakespeare-macbeth.txt')
print flesch(macbeth)
print flesch_kincaid(macbeth)

This prints 112.27804859129883, and 0.6579340562875089, respectively, indicating it is easy to understand. The next example is the King James Version (KJV) of the Holy Bible:

kjv = gutenberg.raw('bible-kjv.txt')
print flesch(kjv)
print flesch_kincaid(kjv)

This prints 79.64174894275615, and 9.008527536596926, respectively, implying that it is less easy to understand.

Other metrics include Gunning fox index.

Last month, Matthew Lipson wrote on his blog about the language used by the candidates of the 2016 Presidential Elections for the United States. The metrics introduced can be used as an indication of the literary level of the candidates. In the above two metrics, Hilary Clinton scores the most in readability, and Donald Trump the least.

Continue reading “Flesch-Kincaid Readability Measure”

Advertisements

Beauty of Math and Information

My cousin in China bought me this book from China.

IMG_20160320_225708

The title, Shu Xue Zhi Mei, can be translated literally to “The Beauty of Math,” but the content is on information theory and data mining. The author, Jun Wu, was a scientist in Google at its early stage. He graduated from Tsinghua University and Johns Hopkins University. He is an expert of natural language processing and search engines.

I just started reading this book. But I would like to share the very first section that I read and found very interesting. He told a story about a combination of entropy and information theory beautifully.

A function of languages is to convey information (while the theologians further say that language is related to act, in speech-act theory, in the doctrine of Scripture. See this.) Ancient Egyptians and Chinese invented hieroglyphs, a language system that represents information, which can be seen as clustering in the sense of machine learning. Indeed, a character or a symbol in Chinese do represent an area of meaning. And when we have more concepts, we introduce more characters, or equivalently, add more clusters. It is indeed what has been happening: the Chinese invented new words to cover new knowledge.

Thanks to the Phoenicians, phonetic languages actually reduce the problem of introducing new clusters that require much effort for human to learn. A combination of a small number of letters (or alphabets, or aleph-bets…), together with a set of grammar rules, can represent complicated enough concepts.

Later John von Neumann introduced the concept of information entropy, which is essentially the number of bits (0 or 1) that are required to represent a variety of concepts. See my previous post on entropy. Bit might be the most compact way of representing information, but redundancy in all languages is necessary in case of loss in transmission.

Continue reading “Beauty of Math and Information”

Create a free website or blog at WordPress.com.

Up ↑