Previously, I wrote an entry on text mining on R and Python, and did a comparison. However, the text mining package employed was tm for R. But it has some problems:
The syntax is not natural for an experienced R users.
tm uses simple_triplet_matrix from the slam library for document-term matrix (DTM) and term-occurrence matrix (TCM), which is not as widely used as dgCMatrix from the Matrix library.
Tommy Jones, a Ph.D. student in George Mason University, and a data scientist at Impact Research, developed an alternative text mining package called textmineR. He presented in a Stat Prog DC Meetup on April 27, 2016. It employed a better syntax, and dgCMatrix. All in all, it is a wrapper for a lot of existing R packages to facilitate the text mining process, like creating DTM matrices with stopwords or appropriate stemming/lemmatizing functions. Here is a sample code to create a DTM with the example from the previous entry:
texts <- c('I love Python.',
'R is good for analytics.',
'Mathematics is fun.')
doc_names = c(1:length(texts)),
ngram_window = c(1, 1),
stopword_vec = c(tm::stopwords('english'), tm::stopwords('SMART')),
lower = TRUE,
remove_punctuation = TRUE,
remove_numbers = TRUE
The DTM is a sparse matrix:
3 x 6 sparse Matrix of class &quot;dgCMatrix&quot;
analytics fun mathematics good python love
1 . . . . 1 1
2 1 . . 1 . .
3 . 1 1 . . .
On the other hand, it wraps text2vec, an R package that wraps the word-embedding algorithm named gloVe. And it wraps a number of topic modeling algorithms, such as latent Dirichlet allocation (LDA) and correlated topic models (CTM).
In addition, it contains a parallel computing loop function called TmParallelApply, analogous to the original R parallel loop function mclapply, but TmParallelApply works on Windows as well.
textmineR is an open-source project, with source code available on github, which contains his example codes.
Embedding has been hot in recent years partly due to the success of Word2Vec, (see demo in my previous entry) although the idea has been around in academia for more than a decade. The idea is to transform a vector of integers into continuous, or embedded, representations. Keras, a Python package that implements neural network models (including the ANN, RNN, CNN etc.) by wrapping Theano or TensorFlow, implemented it, as shown in the example below (which converts a vector of 200 features into a continuous vector of 10):
from keras.layers import Embedding
from keras.models import Sequential
# define and compile the embedding model
model = Sequential()
model.add(Embedding(200, 10, input_length=1))
model.compile('rmsprop', 'mse') # optimizer: rmsprop; loss function: mean-squared error
We can then convert any features from 0 to 199 into vectors of 20, as shown below:
import numpy as np
model.predict(np.array([10, 90, 151]))
Of course, one must not omit a similar algorithm called GloVe, developed by the Stanford NLP group. Their codes have been wrapped in both Python (package called glove) and R (library called text2vec).
Besides Word2Vec, there are other word embedding algorithms that try to complement Word2Vec, although many of them are more computationally costly. Previously, I introduced LDA2Vec in my previous entry, an algorithm that combines the locality of words and their global distribution in the corpus. And in fact, word embedding algorithms with a similar ideas are also invented by other scientists, as I have introduced in another entry.
However, there are word embedding algorithms coming out. Since most English words carry more than a single sense, different senses of a word might be best represented by different embedded vectors. Incorporating word sense disambiguation, a method called sense2vec has been introduced by Trask, Michalak, and Liu. (arXiv:1511.06388). Matthew Honnibal wrote a nice blog entry demonstrating its use.
There are also other related work, such as wang2vec that is more sensitive to word orders.
Big Bang Theory (Season 2, Episode 5): Euclid Alternative
DMV staff: Application? Sheldon: I’m actually more or a theorist.
Note: feature image taken from Big Bang Theory (CBS).
Word2Vec has hit the NLP world for a while, as it is a nice method for word embeddings or word representations. Its use of skip-gram model and deep learning made a big impact too. It has been my favorite toy indeed. However, even though the words do have a correlation across a small segment of text, it is still a local coherence. On the other hand, topic models such as latent Dirichlet allocation (LDA) capture the distribution of words within a topic, and that of topics within a document etc. And it provides a representation of a new document in terms of a topic.
In my previous blog entry, I introduced Chris Moody’s LDA2Vec algorithm (see: his SlideShare). Unfortunately, not many papers or blogs have covered this new algorithm too much despite its potential. The API is not completely well documented yet, although you can see its example from its source code on its Github. In its documentation, it gives an example of deriving topics from an array of random numbers, in its lda2vec/lda2vec.py code:
A more comprehensive example is in examples/twenty_newsgroup/lda.py .
Besides, LDA2Vec, there are some related research work on topical word embeddings too. A group of Australian and American scientists studied about the topic modeling with pre-trained Word2Vec (or GloVe) before performing LDA. (See: their paper and code) On the other hand, another group with Chinese and Singaporean scientists performs LDA, and then trains a Word2Vec model. (See: their paper and code) And LDA2Vec concatenates the Word2Vec and LDA representation, like an early fusion.
No matter what, representations with LDA models (or related topic modeling such as correlated topic models (CTM)) can be useful even outside NLP. I have found it useful at some intermediate layer of calculation lately.
In 1970s, long before artificial intelligence and natural language processing becoming hot, there have already been metrics to measure the ease of reading, or readability, a certain text. these metrics were designed in order to limit the difficulty level of government, legal, and commercial documents.
Flesch-Kincaid readability measures, developed by the United States Navy, are some of the popular measures. There are two metrics under this umbrella, namely, Flesch readability ease, and Flesch-Kincaid grade level. Despite their distinction, the intuition of both measures are that a text is more difficult to read if 1) there are more words in a sentence on average, and 2) the words are longer, or have more syllables. It makes #words/#sentences and #syllables/#words important terms in both metrics. The formulae for both metrics are given as:
Therefore, the more difficult the passage is, the lower its Flesch readability ease, and the higher its Flesch-Kincaid grade level.
With the packages of natural language processing, it is not at all difficult to calculate these metrics. We can apply the NLTK library in Python. To calculate the numbers of words and sentences in a text, we need the tokenizers, which can be imported easily.
from nltk.tokenize import sent_tokenize, word_tokenize
And the counts can be easily implemented with the following functions:
not_punctuation = lambda w: not (len(w)==1 and (not w.isalpha()))
get_word_count = lambda text: len(filter(not_punctuation, word_tokenize(text)))
get_sent_count = lambda text: len(sent_tokenize(text))
from nltk.corpus import cmudict
prondict = cmudict.dict()
It would be helpful to go through some examples. This dictionary outputs the pronunciation. For example, by typing prondict[‘apple’], it gives:
[[u'AE1', u'P', u'AH0', u'L']]
Note that the vowels are with a digit at the end. By counting the number of these digits, we retrieve the number of syllables. It would be useful to go through an example of a word with more than one pronunciations, such as prondict[‘orange’] gives:
For simplicity, if there are more than one pronunciations, I take the largest number of syllables in subsequent calculations. Then the counts of words, sentences, and syllables can be summarized in the following function:
Last month, Matthew Lipson wrote on his blog about the language used by the candidates of the 2016 Presidential Elections for the United States. The metrics introduced can be used as an indication of the literary level of the candidates. In the above two metrics, Hilary Clinton scores the most in readability, and Donald Trump the least.
Both LDA (latent Dirichlet allocation) and Word2Vec are two important algorithms in natural language processing (NLP). LDA is a widely used topic modeling algorithm, which seeks to find the topic distribution in a corpus, and the corresponding word distributions within each topic, with a prior Dirichlet distribution. Word2Vec is a vector-representation model, trained from RNN (recurrent neural network), to seek a continuous representation for words.
They are both very useful, but LDA deals with words and documents globally, and Word2Vec locally (depending on adjacent words in the training data). A LDA vector is so sparse that the users can interpret the topic easily, but it is inflexible. Word2Vec’s representation is not human-interpretable, but it is easy to use. In his slides, Chris Moody recently devises a topic modeling algorithm, called LDA2Vec, which is a hybrid of the two, to get the best out of the two algorithms.
Honestly, I never used this algorithm. I rarely talk about something I didn’t even try, but I want to raise awareness so that more people know about it when I come to use it. To me, it looks like concatenating two vectors with some hyperparameters, but the source codes rejects this claim. It is a topic model algorithm.
There are not many blogs or papers talking about LDA2Vec yet. I am looking forward to learning more about it when there are more awareness.
We “sensed” what has been the current hot issues in the past (and we still often do today.) Methods of “sensing,” or “detecting”, is now more sophisticated however as the computational technologies are now more advanced. The methods involved can be collected to a field called “computational journalism.”
Recently, there is a blog post by Jeiran about understanding the public impression about Iran using computational methods. She divided the question into the temporal and topical perspectives. The temporal perspective is about various time-varying patterns of the number of related news articles; the topical perspective is about the distribution of various topics, using latent Dirichlet allocation (LDA), and Bayes’ Theorem. The blog post is worth reading.
In February last year, there was a video clip online that Daeil Kim, a data scientist at New York Times, spoke at NYC Data Science Meetup. Honestly, I still have not watched it yet (but I think I should have.) What his work is also about computational journalism, on his algorithm, and LDA.
Of course, computational journalism is the application of natural language processing and machine learning on news articles… However, as a computational physicist has to know physics, a computational journalist has to know journalism. A data scientist has to be someone who knows the technology and the subject matter.
It is another task of text mining and natural language processing. He mentioned a few common tools. However, he also specifically looked at rhymes (as rhyme is an important element of rap lyrics), and profanity (as rap music is commonly, or stereotypically, dirty).
One fascinating application of deep learning is the training of a model that outputs vectors representing words. A project written in Google, named Word2Vec, is one of the best tools regarding this. The vector representation captures the word contexts and relationships among words. This tool has been changing the landscape of natural language processing (NLP).
Let’s have some demonstration. To use Word2Vec in Python, you need to have the package gensim installed. (Installation instruction: here) And you have to download a trained model (GoogleNews-vectors-negative300.bin.gz), which is 3.6 GB big!! When you get into a Python shell (e.g., IPython), type
from gensim.models.word2vec import Word2Vec
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
This model enables the user to extract vector representation of length 300 of an English word. So what is so special about this vector representation from the traditional bag-of-words representation? First, the representation is standard. Once trained, we can use it in future training or testing dataset. Second, it captures the context of the word in a way that the algebraic operation of these vectors has meanings.
Here I give 5 examples.
A Juvenile Cat
What is a juvenile cat? We know that a juvenile dog is a puppy. Then we can get it by carry out the algebraic calculation by running
which indicates that “kitten” is the answer (correctly!) The numbers are similarities of these words with the vector representation in descending order. You can verify it by calculating the cosine distance:
from scipy.spatial import distance
print (1-distance.cosine(model['kitten'], model['puppy']+model['cat']-model['dog']))
which outputs 0.763498957413.
This demonstration shows that in the model, and are of similar semantic relations.
Capital of Taiwan
Where is the capital of Taiwan? We can find it if we know the capital of another country. For example, we know that Beijing is the capital of China. Then we can run the following:
However, this model does not always work. If it can find the capital of Taiwan, can it find those for any states in the United States? We know that the capital of California is Sacramento. How about Maryland? Let’s run:
Word2Vec was developed by Tomáš Mikolov. He previously worked for Microsoft Research. However, he switched to Google, and published a few influential works on Word2Vec. [Mikolov, Yih, Zweig 2013] [Mikolov, Sutskever, Chen, Corrado, Dean 2013] [Mikolov, Chen, Corrado, Dean 2013] Their conference paper in 2013 can be found on arXiv. He later published a follow-up work on a package called Doc2Vec that considers phrases. [Le, Mikolov 2014]
Earlier this year, I listened to a talk in DCNLP meetup spoken by Michael Czerny on his award-winning blog entry titled “Modern Methods for Sentiment Analysis.” He applied the vector representations of words by Word2Vec to perform sentiment analysis, assuming that similar sentiments cluster together in the vector space. (He took averages of the vectors in tweets to extract emotions.) [Czerny 2015] I highly recommend you to read his blog entry. On the other hand, Xin Rong wrote an explanation about how Word2Vec works too. [Rong 2014]
There seems to be no progress on the project Word2Vec anymore as Tomáš Mikolov no longer works in Google. However, the Stanford NLP Group recognized that Word2Vec captures the relations between words in their vector representation. They worked on a similar project, called GloVe (Global Vectors), which tackles the problem with matrix factorization. [Pennington, Socher, Manning 2014] Radim Řehůřek did some analysis comparing Word2Vec and GloVe. [Řehůřek 2014] GloVe can be implemented in Python too.
On October 14, 2015, I attended the regular meeting of the DCNLP meetup group, a group on natural language processing (NLP) in Washington, DC area. The talk was titled “Deep Learning for Question Answering“, spoken by Mr. Mohit Iyyer, a Ph.D. student in Department of Computer Science, University of Maryland (my alma mater!). He is a very good speaker.
I have no experience on deep learning at all although I did write a blog post remotely related. I even didn’t start training my first neural network until the next day after the talk. However, Mr. Iyyer explained what recurrent neural network (RNN), recursive neural network, and deep averaging network (DAN) are. This helped me a lot in order to understanding more about the principles of the famous word2vec model (which is something I am going to write about soon!). You can refer to his slides for more details. There are really a lot of talents in College Park, like another expert, Joe Yue Hei Ng, who is exploiting deep learning a lot as well.
The applications are awesome: with external knowledge to factual question answering, reasoning-based question answering, and visual question answering, with increasing order of challenging levels.
Mr. Iyyer and the participants discussed a lot about different packages. Mr. Iyyer uses Theano, a Python package for deep learning, which is good for model building and other analytical work. Some prefer Caffe. Some people, who are Java developers, also use deeplearning4j.
This meetup was a sacred one too, because it is the last time it was held in Stetsons Famous Bar & Grill at U Street, which is going to permanently close on Halloween this year. The group is eagerly looking for a new venue for the upcoming meetup. This meeting was a crowded one. I sincerely thank the organizers, Charlie Greenbacker and Liz Merkhofer, for hosting all these meetings, and Chris Phipps (a linguist from IBM Watson) for recording.
Entropy is one of the most fascinating ideas in the history of mathematical sciences.
In Phenomenological Thermodynamics…
Entropy was introduced into thermodynamics in the 19th century. Like the free energies, it describes the state of a thermodynamic system. At the beginning, entropy is merely phenomenological. The physicists found it useful to incorporate the description using entropy in the second law of thermodynamics with clarity and simplicity, instead of describing it as convoluted heat flow (which is what it is originally about) among macroscopic systems (say, the heat flow from the hotter pot of water to the air of the room). It did not carry any statistical meaning at all until 1870s.
In Statistical Physics…
Ludwig Boltzmann (1844-1906)
The statistical meaning of entropy was developed by Ludwig Boltzmann, a pioneer of statistical physics, who studied the connection of the macroscopic thermodynamic behavior to the microscopic components of the system. For example, he described the temperature to be the average of the fluctuating kinetic energy of the particles. And he formulated the entropy to be
where i is the label for each microstate, and is the Boltzmann’s constant. And in a closed system, the total entropy never decreases.
Information Theory and Statistical Physics United
In statistical physics, Boltzmann’s assumption of equal a priori equilibrium properties is an important assumption. However, in 1957, E. T. Jaynes published a paper relating information theory and statistical physics in Physical Review indicating that merely the principle of maximum entropy is sufficient to describe equilibrium statistical system. [Jaynes 1957] In statistical physics, we are aware that systems can be described as canonical ensemble, or a softmax function (normalized exponential), i.e., . This can be easily derived by the principle of maximum entropy and the conservation of energy. Or mathematically, the probabilities for all states i with energies can be obtained by maximizing the entropy
under the constraints
where E is a constant. The softmax distribution can be obtained by this simple optimization problem, using basic variational calculus (Euler-Lagrange equation) and Lagrange’s multipliers.
The principle of maximum entropy can be found in statistics too. For example, the form of Gaussian distribution can be obtained by maximizing the entropy
with the knowledge of the mean and the variance , or mathematically speaking, under the constraints,
In any statistical systems, the probability distributions can be computed with the principle of maximum entropy, as Jaynes put it [Jaynes 1957]
It is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.
In statistical physics, entropy is roughly a measure how “chaotic” a system is. In information theory, entropy is a measure how surprising the information is. The smaller the entropy is, the more surprising the information is. And it assumes no additional information. Without constraints other than the normalization, the probability distribution is that all ‘s are equal, which is equivalent to the least surprise. Lê Nguyên Hoang, a scientist at Massachusetts Institute of Technology, wrote a good blog post about the meaning of entropy in information theory. [Hoang 2013] In information theory, the entropy is given by
which is different from the thermodynamic entropy by the constant and the coefficient . The entropies in information theory and statistical physics are equivalent.
Entropy in Natural Language Processing (NLP)
The principle of maximum entropy assumes nothing other than the given information to compute the most optimized probability distribution, which makes it a desirable algorithm in machine learning. It can be regarded as a supervised learning algorithm, with the features being , where p is the property calculated, and c is the class. The probability for is proportional to , where is the coefficient to be found during training. There are some technical note to compute all these coefficients, which essentially involves solving a system of algebraic equations numerically using techniques such as generalized iterative scaling (GIS).
Does it really assume no additional information? No. The way you construct the features is how you add information. But once the features are defined, the calculation depends on the training data only.
The classifier based on maximum entropy has found its application in part-of-speech (POS) tagging, machine translation (ML), speech recognition, and text mining. A good review was written by Berger and Della Pietra’s. [Berger, Della Pietra, Della Pietra 1996] A lot of open-source softwares provide maximum entropy classifiers, such as Python NLTK and Apache OpenNLP.
In Quantum Computation…
One last word, entropy is used to describe quantum entanglement. A composite bipartite quantum system is said to be entangled if its subsystems must be described in a mixed state, i.e., it must be statistical if one of the subsystems is only considered. Then the entanglement entropy is given by [Nielssen, Chuang 2011]
which is essentially the same formula. The more entangled the system is, the larger the entanglement entropy. However, composite quantum systems tend to decrease their entropy over time though.