textmineR: a New Text Mining Package for R

Previously, I wrote an entry on text mining on R and Python, and did a comparison. However, the text mining package employed was tm for R. But it has some problems:

  1. The syntax is not natural for an experienced R users.
  2. tm uses simple_triplet_matrix from the slam library for document-term matrix (DTM) and term-occurrence matrix (TCM), which is not as widely used as dgCMatrix from the Matrix library.

Tommy Jones, a Ph.D. student in George Mason University, and a data scientist at Impact Research, developed an alternative text mining package called textmineR. He presented in a Stat Prog DC Meetup on April 27, 2016. It employed a better syntax, and dgCMatrix. All in all, it is a wrapper for a lot of existing R packages to facilitate the text mining process, like creating DTM matrices with stopwords or appropriate stemming/lemmatizing functions. Here is a sample code to create a DTM with the example from the previous entry:


texts <- c('I love Python.',
           'R is good for analytics.',
           'Mathematics is fun.')

               doc_names = c(1:length(texts)),
               ngram_window = c(1, 1),
               stopword_vec = c(tm::stopwords('english'), tm::stopwords('SMART')),
               lower = TRUE,
               remove_punctuation = TRUE,
               remove_numbers = TRUE

The DTM is a sparse matrix:

3 x 6 sparse Matrix of class &amp;quot;dgCMatrix&amp;quot;
  analytics fun mathematics good python love
1         .   .           .    .      1    1
2         1   .           .    1      .    .
3         .   1           1    .      .    .

On the other hand, it wraps text2vec, an R package that wraps the word-embedding algorithm named gloVe. And it wraps a number of topic modeling algorithms, such as latent Dirichlet allocation (LDA) and correlated topic models (CTM).

In addition, it contains a parallel computing loop function called TmParallelApply, analogous to the original R parallel loop function mclapply, but TmParallelApply works on Windows as well.

textmineR is an open-source project, with source code available on github, which contains his example codes.

Continue reading “textmineR: a New Text Mining Package for R”

Word Embedding Algorithms

Embedding has been hot in recent years partly due to the success of Word2Vec, (see demo in my previous entry) although the idea has been around in academia for more than a decade. The idea is to transform a vector of integers into continuous, or embedded, representations. Keras, a Python package that implements neural network models (including the ANN, RNN, CNN etc.) by wrapping Theano or TensorFlow, implemented it, as shown in the example below (which converts a vector of 200 features into a continuous vector of 10):

from keras.layers import Embedding
from keras.models import Sequential

# define and compile the embedding model
model = Sequential()
model.add(Embedding(200, 10, input_length=1))
model.compile('rmsprop', 'mse')  # optimizer: rmsprop; loss function: mean-squared error

We can then convert any features from 0 to 199 into vectors of 20, as shown below:

import numpy as np

model.predict(np.array([10, 90, 151]))

It outputs:

array([[[ 0.02915354,  0.03084954, -0.04160764, -0.01752155, -0.00056815,
         -0.02512387, -0.02073313, -0.01154278, -0.00389587, -0.04596512]],

       [[ 0.02981793, -0.02618774,  0.04137352, -0.04249889,  0.00456919,
          0.04393572,  0.04139435,  0.04415271,  0.02636364, -0.04997493]],

       [[ 0.00947296, -0.01643104, -0.03241419, -0.01145032,  0.03437041,
          0.00386361, -0.03124221, -0.03837727, -0.04804075, -0.01442516]]])

Of course, one must not omit a similar algorithm called GloVe, developed by the Stanford NLP group. Their codes have been wrapped in both Python (package called glove) and R (library called text2vec).

Besides Word2Vec, there are other word embedding algorithms that try to complement Word2Vec, although many of them are more computationally costly. Previously, I introduced LDA2Vec in my previous entry, an algorithm that combines the locality of words and their global distribution in the corpus. And in fact, word embedding algorithms with a similar ideas are also invented by other scientists, as I have introduced in another entry.

However, there are word embedding algorithms coming out. Since most English words carry more than a single sense, different senses of a word might be best represented by different embedded vectors. Incorporating word sense disambiguation, a method called sense2vec has been introduced by Trask, Michalak, and Liu. (arXiv:1511.06388). Matthew Honnibal wrote a nice blog entry demonstrating its use.

There are also other related work, such as wang2vec that is more sensitive to word orders.

Big Bang Theory (Season 2, Episode 5): Euclid Alternative

DMV staff: Application?
Sheldon: I’m actually more or a theorist.

Note: feature image taken from Big Bang Theory (CBS).

Continue reading “Word Embedding Algorithms”

Local and Global: Words and Topics

Word2Vec has hit the NLP world for a while, as it is a nice method for word embeddings or word representations. Its use of skip-gram model and deep learning made a big impact too. It has been my favorite toy indeed. However, even though the words do have a correlation across a small segment of text, it is still a local coherence. On the other hand, topic models such as latent Dirichlet allocation (LDA) capture the distribution of words within a topic, and that of topics within a document etc. And it provides a representation of a new document in terms of a topic.

In my previous blog entry, I introduced Chris Moody’s LDA2Vec algorithm (see: his SlideShare). Unfortunately, not many papers or blogs have covered this new algorithm too much despite its potential. The API is not completely well documented yet, although you can see its example from its source code on its Github. In its documentation, it gives an example of deriving topics from an array of random numbers, in its lda2vec/lda2vec.py code:

from lda2vec import LDA2Vec
n_words = 10
n_docs = 15
n_hidden = 8
n_topics = 2
n_obs = 300
words = np.random.randint(n_words, size=(n_obs))
_, counts = np.unique(words, return_counts=True)
model = LDA2Vec(n_words, n_hidden, counts)
model.add_categorical_feature(n_docs, n_topics, name='document id')
doc_ids = np.arange(n_obs) % n_docs
loss = model.fit_partial(words, 1.0, categorical_features=doc_ids)

A more comprehensive example is in examples/twenty_newsgroup/lda.py .

Besides, LDA2Vec, there are some related research work on topical word embeddings too. A group of Australian and American scientists studied about the topic modeling with pre-trained Word2Vec (or GloVe) before performing LDA. (See: their paper and code) On the other hand, another group with Chinese and Singaporean scientists performs LDA, and then trains a Word2Vec model. (See: their paper and code) And LDA2Vec concatenates the Word2Vec and LDA representation, like an early fusion.

No matter what, representations with LDA models (or related topic modeling such as correlated topic models (CTM)) can be useful even outside NLP. I have found it useful at some intermediate layer of calculation lately.

Graphical Representations of LDA model. Source: http://salsahpc.indiana.edu/b649proj/images/proj3_LDA%20structure.png

Continue reading “Local and Global: Words and Topics”

Flesch-Kincaid Readability Measure

In 1970s, long before artificial intelligence and natural language processing becoming hot, there have already been metrics to measure the ease of reading, or readability, a certain text. these metrics were designed in order to limit the difficulty level of government, legal, and commercial documents.

Flesch-Kincaid readability measures, developed by the United States Navy, are some of the popular measures. There are two metrics under this umbrella, namely, Flesch readability ease, and Flesch-Kincaid grade level. Despite their distinction, the intuition of both measures are that a text is more difficult to read if 1) there are more words in a sentence on average, and 2) the words are longer, or have more syllables. It makes #words/#sentences and #syllables/#words important terms in both metrics. The formulae for both metrics are given as:

\text{Flesch readability ease} = 206.835 - 1.015 \frac{\text{number of words}}{\text{number of sentences}} - 84.6 \frac{\text{number of syllables}}{\text{number of words}},

\text{Flesch-Kincaid grade level} = 0.39 \frac{\text{number of words}}{\text{number of sentences}} + 11.8 \frac{\text{number of syllables}}{\text{number of words}} - 15.59.

Therefore, the more difficult the passage is, the lower its Flesch readability ease, and the higher its Flesch-Kincaid grade level.

With the packages of natural language processing, it is not at all difficult to calculate these metrics. We can apply the NLTK library in Python. To calculate the numbers of words and sentences in a text, we need the tokenizers, which can be imported easily.

from nltk.tokenize import sent_tokenize, word_tokenize

And the counts can be easily implemented with the following functions:

not_punctuation = lambda w: not (len(w)==1 and (not w.isalpha()))
get_word_count = lambda text: len(filter(not_punctuation, word_tokenize(text)))
get_sent_count = lambda text: len(sent_tokenize(text))

The first function, not_punctuation, is used to filter out tokens that are not English words. For the number of syllables, we need the Carnegie Mellon University (CMU) Pronouncing Dictionary, which is also included in NLTK:

from nltk.corpus import cmudict
prondict = cmudict.dict()

It would be helpful to go through some examples. This dictionary outputs the pronunciation. For example, by typing prondict[‘apple’], it gives:

[[u'AE1', u'P', u'AH0', u'L']]

Note that the vowels are with a digit at the end. By counting the number of these digits, we retrieve the number of syllables. It would be useful to go through an example of a word with more than one pronunciations, such as prondict[‘orange’] gives:

[[u'AO1', u'R', u'AH0', u'N', u'JH'], [u'AO1', u'R', u'IH0', u'N', u'JH']]

If the word is not in the dictionary, it throws a <pre>KeyError</pre>. We can implement the counting of syllables by the following code:

numsyllables_pronlist = lambda l: len(filter(lambda s: isdigit(s.encode('ascii', 'ignore').lower()[-1]), l))
def numsyllables(word):
    return list(set(map(numsyllables_pronlist, prondict[word.lower()])))
  except KeyError:
    return [0]

For simplicity, if there are more than one pronunciations, I take the largest number of syllables in subsequent calculations. Then the counts of words, sentences, and syllables can be summarized in the following function:

def text_statistics(text):
  word_count = get_word_count(text)
  sent_count = get_sent_count(text)
  syllable_count = sum(map(lambda w: max(numsyllables(w)), word_tokenize(text)))
  return word_count, sent_count, syllable_count

And the two metrics can be implemented as:

flesch_formula = lambda word_count, sent_count, syllable_count : 206.835 - 1.015*word_count/sent_count - 84.6*syllable_count/word_count
def flesch(text):
  word_count, sent_count, syllable_count = text_statistics(text)
  return flesch_formula(word_count, sent_count, syllable_count)

fk_formula = lambda word_count, sent_count, syllable_count : 0.39 * word_count / sent_count + 11.8 * syllable_count / word_count - 15.59
def flesch_kincaid(text):
  word_count, sent_count, syllable_count = text_statistics(text)
  return fk_formula(word_count, sent_count, syllable_count)

Let’s go through a few examples. We can access the text of MacBeth written by William Shakespeare by accessing the Gutenberg corpus:

from nltk.corpus import gutenberg
macbeth = gutenberg.raw('shakespeare-macbeth.txt')
print flesch(macbeth)
print flesch_kincaid(macbeth)

This prints 112.27804859129883, and 0.6579340562875089, respectively, indicating it is easy to understand. The next example is the King James Version (KJV) of the Holy Bible:

kjv = gutenberg.raw('bible-kjv.txt')
print flesch(kjv)
print flesch_kincaid(kjv)

This prints 79.64174894275615, and 9.008527536596926, respectively, implying that it is less easy to understand.

Other metrics include Gunning fox index.

Last month, Matthew Lipson wrote on his blog about the language used by the candidates of the 2016 Presidential Elections for the United States. The metrics introduced can be used as an indication of the literary level of the candidates. In the above two metrics, Hilary Clinton scores the most in readability, and Donald Trump the least.

Continue reading “Flesch-Kincaid Readability Measure”

LDA2Vec: a hybrid of LDA and Word2Vec

Both LDA (latent Dirichlet allocation) and Word2Vec are two important algorithms in natural language processing (NLP). LDA is a widely used topic modeling algorithm, which seeks to find the topic distribution in a corpus, and the corresponding word distributions within each topic, with a prior Dirichlet distribution. Word2Vec is a vector-representation model, trained from RNN (recurrent neural network), to seek a continuous representation for words.

They are both very useful, but LDA deals with words and documents globally, and Word2Vec locally (depending on adjacent words in the training data). A LDA vector is so sparse that the users can interpret the topic easily, but it is inflexible. Word2Vec’s representation is not human-interpretable, but it is easy to use. In his slides, Chris Moody recently devises a topic modeling algorithm, called LDA2Vec, which is a hybrid of the two, to get the best out of the two algorithms.

Honestly, I never used this algorithm. I rarely talk about something I didn’t even try, but I want to raise awareness so that more people know about it when I come to use it. To me, it looks like concatenating two vectors with some hyperparameters, but  the source codes rejects this claim. It is a topic model algorithm.

There are not many blogs or papers talking about LDA2Vec yet. I am looking forward to learning more about it when there are more awareness.

Jupyter Notebook for LDA2Vec Demonstration [link]
Continue reading “LDA2Vec: a hybrid of LDA and Word2Vec”

Computational Journalism

We “sensed” what has been the current hot issues in the past (and we still often do today.) Methods of “sensing,” or “detecting”, is now more sophisticated however as the computational technologies are now more advanced. The methods involved can be collected to a field called “computational journalism.”

Recently, there is a blog post by Jeiran about understanding the public impression about Iran using computational methods. She divided the question into the temporal and topical perspectives. The temporal perspective is about various time-varying patterns of the number of related news articles; the topical perspective is about the distribution of various topics, using latent Dirichlet allocation (LDA), and Bayes’ Theorem. The blog post is worth reading.

In February last year, there was a video clip online that Daeil Kim, a data scientist at New York Times, spoke at NYC Data Science Meetup. Honestly, I still have not watched it yet (but I think I should have.) What his work is also about computational journalism, on his algorithm, and LDA.

Of course, computational journalism is the application of natural language processing and machine learning on news articles… However, as a computational physicist has to know physics, a computational journalist has to know journalism. A data scientist has to be someone who knows the technology and the subject matter.

Continue reading “Computational Journalism”


Text mining can be applied on rap lyrics.

Today I attended an event organized by Data Science MD Meetup Group, a talk titled “Lose Yourself in Rapalytics,” by Abhay, a PhD student in University of Maryland, Baltimore County (UMBC). Rapalytics is an online tool analyzing raps.

It is another task of text mining and natural language processing. He mentioned a few common tools. However, he also specifically looked at rhymes (as rhyme is an important element of rap lyrics), and profanity (as rap music is commonly, or stereotypically, dirty).

Screen Shot 2015-11-19 at 11.51.33 PM.png
Analysis of the use of F*ck word in rap lyrics using Rapalytics

Play with it!

Continue reading “Rapalytics”

Toying with Word2Vec

One fascinating application of deep learning is the training of a model that outputs vectors representing words. A project written in Google, named Word2Vec, is one of the best tools regarding this. The vector representation captures the word contexts and relationships among words. This tool has been changing the landscape of natural language processing (NLP).

Let’s have some demonstration. To use Word2Vec in Python, you need to have the package gensim installed. (Installation instruction: here) And you have to download a trained model (GoogleNews-vectors-negative300.bin.gz), which is 3.6 GB big!! When you get into a Python shell (e.g., IPython), type

from gensim.models.word2vec import Word2Vec
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

This model enables the user to extract vector representation of length 300 of an English word. So what is so special about this vector representation from the traditional bag-of-words representation? First, the representation is standard. Once trained, we can use it in future training or testing dataset. Second, it captures the context of the word in a way that the algebraic operation of these vectors has meanings.

Here I give 5 examples.

A Juvenile Cat

What is a juvenile cat? We know that a juvenile dog is a puppy. Then we can get it by carry out the algebraic calculation \text{puppy} - \text{dog} + \text{cat} by running

model.most_similar(positive=['puppy', 'cat'], negative=['dog'], topn=5)

This outputs:

[(u'kitten', 0.7634989619255066),
(u'puppies', 0.7110899686813354),
(u'pup', 0.6929495334625244),
(u'kittens', 0.6888389587402344),
(u'cats', 0.6796488761901855)]

which indicates that “kitten” is the answer (correctly!) The numbers are similarities of these words with the vector representation  \text{puppy} - \text{dog} + \text{cat} in descending order. You can verify it by calculating the cosine distance:

from scipy.spatial import distance
print (1-distance.cosine(model['kitten'], model['puppy']+model['cat']-model['dog']))

which outputs 0.763498957413.

Mogu, my cat, three years ago when she was still a kitten
Mogu, my cat, three years ago when she was still a kitten

This demonstration shows that in the model, \text{puppy}-\text{dog} and \text{kitten}-\text{cat} are of similar semantic relations.

Capital of Taiwan

Where is the capital of Taiwan? We can find it if we know the capital of another country. For example, we know that Beijing is the capital of China. Then we can run the following:

model.most_similar(positive=['Beijing', 'Taiwan'], negative=['China'], topn=5)

which outputs

[(u'Taipei', 0.7866502404212952),
(u'Taiwanese', 0.6805002093315125),
(u'Kaohsiung', 0.6034111976623535),
(u'Chen', 0.5905819535255432),
(u'Seoul', 0.5865181684494019)]

Obviously, the answer is “Taipei.” And interestingly, the model sees Taiwan in the same footing of China!

Taipei (taken from Airasia: http://www.airasia.com/mo/en/destinations/taipei.page)

Past Participle of “eat”

We can extract grammatical information too. We know that the past participle of “go” is “gone”. With this, we can find that of “eat” by running:

model.most_similar(positive=[‘gone’, ‘eat’], negative=[‘go’], topn=5)

which outputs:

[(u'eaten', 0.7462186217308044),
(u'eating', 0.6516293287277222),
(u'ate', 0.6457351446151733),
(u'overeaten', 0.5853317975997925),
(u'eats', 0.5830586552619934)]

Capital of the State of Maryland

However, this model does not always work. If it can find the capital of Taiwan, can it find those for any states in the United States? We know that the capital of California is Sacramento. How about Maryland? Let’s run:

model.most_similar(positive=['Sacramento', 'Maryland'], negative=['California'], topn=5)

which sadly outputs:

[(u'Towson', 0.7032245397567749),
(u'Baltimore', 0.6951349973678589),
(u'Hagerstown', 0.6367553472518921),
(u'Anne_Arundel', 0.5931429266929626),
(u'Oxon_Hill', 0.5879474878311157)]

But the correct answer should be Annapolis!

Downtown Annapolis (taken from Wikipedia)

Blue crabs (lunch in Cantler's Riverside Inn, Annapolis, MD)
Blue crabs (lunch in Cantler’s Riverside Inn, Annapolis, MD)

More About Word2Vec

Word2Vec was developed by Tomáš Mikolov. He previously worked for Microsoft Research. However, he switched to Google, and published a few influential works on Word2Vec. [Mikolov, Yih, Zweig 2013] [Mikolov, Sutskever, Chen, Corrado, Dean 2013] [Mikolov, Chen, Corrado, Dean 2013] Their conference paper in 2013 can be found on arXiv. He later published a follow-up work on a package called Doc2Vec that considers phrases. [Le, Mikolov 2014]

Earlier this year, I listened to a talk in DCNLP meetup spoken by Michael Czerny on his award-winning blog entry titled “Modern Methods for Sentiment Analysis.” He applied the vector representations of words by Word2Vec to perform sentiment analysis, assuming that similar sentiments cluster together in the vector space. (He took averages of the vectors in tweets to extract emotions.) [Czerny 2015] I highly recommend you to read his blog entry. On the other hand, Xin Rong wrote an explanation about how Word2Vec works too. [Rong 2014]

There seems to be no progress on the project Word2Vec anymore as Tomáš Mikolov no longer works in Google. However, the Stanford NLP Group recognized that Word2Vec captures the relations between words in their vector representation. They worked on a similar project, called GloVe (Global Vectors), which tackles the problem with matrix factorization. [Pennington, Socher, Manning 2014] Radim Řehůřek did some analysis comparing Word2Vec and GloVe. [Řehůřek 2014] GloVe can be implemented in Python too.

Continue reading “Toying with Word2Vec”

Talking Not So Deep About Deep Learning


On October 14, 2015, I attended the regular meeting of the DCNLP meetup group, a group on natural language processing (NLP) in Washington, DC area. The talk was titled “Deep Learning for Question Answering“, spoken by Mr. Mohit Iyyer, a Ph.D. student in Department of Computer Science, University of Maryland (my alma mater!). He is a very good speaker.

I have no experience on deep learning at all although I did write a blog post remotely related. I even didn’t start training my first neural network until the next day after the talk. However, Mr. Iyyer explained what recurrent neural network (RNN), recursive neural network, and deep averaging network (DAN) are. This helped me a lot in order to understanding more about the principles of the famous word2vec model (which is something I am going to write about soon!). You can refer to his slides for more details. There are really a lot of talents in College Park, like another expert, Joe Yue Hei Ng, who is exploiting deep learning a lot as well.

The applications are awesome: with external knowledge to factual question answering, reasoning-based question answering, and visual question answering, with increasing order of challenging levels.

Mr. Iyyer and the participants discussed a lot about different packages. Mr. Iyyer uses Theano, a Python package for deep learning, which is good for model building and other analytical work. Some prefer Caffe. Some people, who are Java developers, also use deeplearning4j.

Stetsons Famous Bar & Grill (photo from Yelp)

This meetup was a sacred one too, because it is the last time it was held in Stetsons Famous Bar & Grill at U Street, which is going to permanently close on Halloween this year. The group is eagerly looking for a new venue for the upcoming meetup. This meeting was a crowded one. I sincerely thank the organizers, Charlie Greenbacker and Liz Merkhofer, for hosting all these meetings, and Chris Phipps (a linguist from IBM Watson) for recording.


Continue reading “Talking Not So Deep About Deep Learning”

The Legacy of Entropy

Entropy is one of the most fascinating ideas in the history of mathematical sciences.

In Phenomenological Thermodynamics…

Entropy was introduced into thermodynamics in the 19th century. Like the free energies, it describes the state of a thermodynamic system. At the beginning, entropy is merely phenomenological. The physicists found it useful to incorporate the description using entropy in the second law of thermodynamics with clarity and simplicity, instead of describing it as convoluted heat flow (which is what it is originally about) among macroscopic systems (say, the heat flow from the hotter pot of water to the air of the room). It did not carry any statistical meaning at all until 1870s.

In Statistical Physics…

Ludwig Boltzmann (1844-1906)

The statistical meaning of entropy was developed by Ludwig Boltzmann, a pioneer of statistical physics, who studied the connection of the macroscopic thermodynamic behavior to the microscopic components of the system. For example, he described the temperature to be the average of the fluctuating kinetic energy of the particles. And he formulated the entropy to be

S = - k_B \sum_i p_i \log p_i,

where i is the label for each microstate, and k_B is the Boltzmann’s constant. And in a closed system, the total entropy never decreases.

Information Theory and Statistical Physics United

In statistical physics, Boltzmann’s assumption of equal a priori equilibrium properties is an important assumption. However, in 1957, E. T. Jaynes published a paper relating information theory and statistical physics in Physical Review indicating that merely the principle of maximum entropy is sufficient to describe equilibrium statistical system. [Jaynes 1957] In statistical physics, we are aware that systems can be described as canonical ensemble, or a softmax function (normalized exponential), i.e., p_i \propto \exp(-\beta E_i). This can be easily derived by the principle of maximum entropy and the conservation of energy. Or mathematically, the probabilities for all states i with energies E_i can be obtained by maximizing the entropy

S = -\sum_i p_i \log p_i,

under the constraints

\sum_i p_i = 1, and
\sum_i p_i E_i = E,

where E is a constant. The softmax distribution can be obtained by this simple optimization problem, using basic variational calculus (Euler-Lagrange equation) and Lagrange’s multipliers.

The principle of maximum entropy can be found in statistics too. For example, the form of Gaussian distribution can be obtained by maximizing the entropy

S = - \int dx \cdot p(x) \log p(x),

with the knowledge of the mean \mu and the variance \sigma^2, or mathematically speaking, under the constraints,

\int dx \cdot p(x) = 1,
\int dx \cdot x p(x) = \mu, and
\int dx \cdot (x-\mu)^2 p(x) = \sigma^2.

In any statistical systems, the probability distributions can be computed with the principle of maximum entropy, as Jaynes put it [Jaynes 1957]

It is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.

In statistical physics, entropy is roughly a measure how “chaotic” a system is. In information theory, entropy is a measure how surprising the information is. The smaller the entropy is, the more surprising the information is. And it assumes no additional information. Without constraints other than the normalization, the probability distribution is that all p_i‘s are equal, which is equivalent to the least surprise. Lê Nguyên Hoang, a scientist at Massachusetts Institute of Technology, wrote a good blog post about the meaning of entropy in information theory. [Hoang 2013] In information theory, the entropy is given by

S = -\sum_i p_i \log_2 p_i,

which is different from the thermodynamic entropy by the constant k_B and the coefficient \log 2. The entropies in information theory and statistical physics are equivalent.

Entropy in Natural Language Processing (NLP)

The principle of maximum entropy assumes nothing other than the given information to compute the most optimized probability distribution, which makes it a desirable algorithm in machine learning. It can be regarded as a supervised learning algorithm, with the features being {p, c}, where p is the property calculated, and c is the class. The probability for {p, c} is proportional to \exp(- \alpha \text{\#}({p, c})), where \alpha is the coefficient to be found during training. There are some technical note to compute all these coefficients, which essentially involves solving a system of algebraic equations numerically using techniques such as generalized iterative scaling (GIS).

Does it really assume no additional information? No. The way you construct the features is how you add information. But once the features are defined, the calculation depends on the training data only.

The classifier based on maximum entropy has found its application in part-of-speech (POS) tagging, machine translation (ML), speech recognition, and text mining. A good review was written by Berger and Della Pietra’s. [Berger, Della Pietra, Della Pietra 1996] A lot of open-source softwares provide maximum entropy classifiers, such as Python NLTK and Apache OpenNLP.

In Quantum Computation…

One last word, entropy is used to describe quantum entanglement. A composite bipartite quantum system is said to be entangled if its subsystems must be described in a mixed state, i.e., it must be statistical if one of the subsystems is only considered. Then the entanglement entropy is given by [Nielssen, Chuang 2011]

S = -\sum_i p_i \log p_i,

which is essentially the same formula. The more entangled the system is, the larger the entanglement entropy. However, composite quantum systems tend to decrease their entropy over time though.

Continue reading “The Legacy of Entropy”

Create a free website or blog at WordPress.com.

Up ↑