textmineR: a New Text Mining Package for R

Previously, I wrote an entry on text mining on R and Python, and did a comparison. However, the text mining package employed was tm for R. But it has some problems:

  1. The syntax is not natural for an experienced R users.
  2. tm uses simple_triplet_matrix from the slam library for document-term matrix (DTM) and term-occurrence matrix (TCM), which is not as widely used as dgCMatrix from the Matrix library.

Tommy Jones, a Ph.D. student in George Mason University, and a data scientist at Impact Research, developed an alternative text mining package called textmineR. He presented in a Stat Prog DC Meetup on April 27, 2016. It employed a better syntax, and dgCMatrix. All in all, it is a wrapper for a lot of existing R packages to facilitate the text mining process, like creating DTM matrices with stopwords or appropriate stemming/lemmatizing functions. Here is a sample code to create a DTM with the example from the previous entry:

library(tm)
library(textmineR)

texts <- c('I love Python.',
           'R is good for analytics.',
           'Mathematics is fun.')

dtm<-CreateDtm(texts,
               doc_names = c(1:length(texts)),
               ngram_window = c(1, 1),
               stopword_vec = c(tm::stopwords('english'), tm::stopwords('SMART')),
               lower = TRUE,
               remove_punctuation = TRUE,
               remove_numbers = TRUE
               )

The DTM is a sparse matrix:

3 x 6 sparse Matrix of class &amp;quot;dgCMatrix&amp;quot;
  analytics fun mathematics good python love
1         .   .           .    .      1    1
2         1   .           .    1      .    .
3         .   1           1    .      .    .

On the other hand, it wraps text2vec, an R package that wraps the word-embedding algorithm named gloVe. And it wraps a number of topic modeling algorithms, such as latent Dirichlet allocation (LDA) and correlated topic models (CTM).

In addition, it contains a parallel computing loop function called TmParallelApply, analogous to the original R parallel loop function mclapply, but TmParallelApply works on Windows as well.

textmineR is an open-source project, with source code available on github, which contains his example codes.

Continue reading “textmineR: a New Text Mining Package for R”

Word Embedding Algorithms

Embedding has been hot in recent years partly due to the success of Word2Vec, (see demo in my previous entry) although the idea has been around in academia for more than a decade. The idea is to transform a vector of integers into continuous, or embedded, representations. Keras, a Python package that implements neural network models (including the ANN, RNN, CNN etc.) by wrapping Theano or TensorFlow, implemented it, as shown in the example below (which converts a vector of 200 features into a continuous vector of 10):

from keras.layers import Embedding
from keras.models import Sequential

# define and compile the embedding model
model = Sequential()
model.add(Embedding(200, 10, input_length=1))
model.compile('rmsprop', 'mse')  # optimizer: rmsprop; loss function: mean-squared error

We can then convert any features from 0 to 199 into vectors of 20, as shown below:

import numpy as np

model.predict(np.array([10, 90, 151]))

It outputs:

array([[[ 0.02915354,  0.03084954, -0.04160764, -0.01752155, -0.00056815,
         -0.02512387, -0.02073313, -0.01154278, -0.00389587, -0.04596512]],

       [[ 0.02981793, -0.02618774,  0.04137352, -0.04249889,  0.00456919,
          0.04393572,  0.04139435,  0.04415271,  0.02636364, -0.04997493]],

       [[ 0.00947296, -0.01643104, -0.03241419, -0.01145032,  0.03437041,
          0.00386361, -0.03124221, -0.03837727, -0.04804075, -0.01442516]]])

Of course, one must not omit a similar algorithm called GloVe, developed by the Stanford NLP group. Their codes have been wrapped in both Python (package called glove) and R (library called text2vec).

Besides Word2Vec, there are other word embedding algorithms that try to complement Word2Vec, although many of them are more computationally costly. Previously, I introduced LDA2Vec in my previous entry, an algorithm that combines the locality of words and their global distribution in the corpus. And in fact, word embedding algorithms with a similar ideas are also invented by other scientists, as I have introduced in another entry.

However, there are word embedding algorithms coming out. Since most English words carry more than a single sense, different senses of a word might be best represented by different embedded vectors. Incorporating word sense disambiguation, a method called sense2vec has been introduced by Trask, Michalak, and Liu. (arXiv:1511.06388). Matthew Honnibal wrote a nice blog entry demonstrating its use.

There are also other related work, such as wang2vec that is more sensitive to word orders.

Big Bang Theory (Season 2, Episode 5): Euclid Alternative

DMV staff: Application?
Sheldon: I’m actually more or a theorist.

Note: feature image taken from Big Bang Theory (CBS).

Continue reading “Word Embedding Algorithms”

Computational Journalism

We “sensed” what has been the current hot issues in the past (and we still often do today.) Methods of “sensing,” or “detecting”, is now more sophisticated however as the computational technologies are now more advanced. The methods involved can be collected to a field called “computational journalism.”

Recently, there is a blog post by Jeiran about understanding the public impression about Iran using computational methods. She divided the question into the temporal and topical perspectives. The temporal perspective is about various time-varying patterns of the number of related news articles; the topical perspective is about the distribution of various topics, using latent Dirichlet allocation (LDA), and Bayes’ Theorem. The blog post is worth reading.

In February last year, there was a video clip online that Daeil Kim, a data scientist at New York Times, spoke at NYC Data Science Meetup. Honestly, I still have not watched it yet (but I think I should have.) What his work is also about computational journalism, on his algorithm, and LDA.

Of course, computational journalism is the application of natural language processing and machine learning on news articles… However, as a computational physicist has to know physics, a computational journalist has to know journalism. A data scientist has to be someone who knows the technology and the subject matter.

Continue reading “Computational Journalism”

Rapalytics

Text mining can be applied on rap lyrics.

Today I attended an event organized by Data Science MD Meetup Group, a talk titled “Lose Yourself in Rapalytics,” by Abhay, a PhD student in University of Maryland, Baltimore County (UMBC). Rapalytics is an online tool analyzing raps.

It is another task of text mining and natural language processing. He mentioned a few common tools. However, he also specifically looked at rhymes (as rhyme is an important element of rap lyrics), and profanity (as rap music is commonly, or stereotypically, dirty).

Screen Shot 2015-11-19 at 11.51.33 PM.png
Analysis of the use of F*ck word in rap lyrics using Rapalytics

Play with it!

Continue reading “Rapalytics”

The Legacy of Entropy

Entropy is one of the most fascinating ideas in the history of mathematical sciences.

In Phenomenological Thermodynamics…

Entropy was introduced into thermodynamics in the 19th century. Like the free energies, it describes the state of a thermodynamic system. At the beginning, entropy is merely phenomenological. The physicists found it useful to incorporate the description using entropy in the second law of thermodynamics with clarity and simplicity, instead of describing it as convoluted heat flow (which is what it is originally about) among macroscopic systems (say, the heat flow from the hotter pot of water to the air of the room). It did not carry any statistical meaning at all until 1870s.

In Statistical Physics…


Ludwig Boltzmann (1844-1906)

The statistical meaning of entropy was developed by Ludwig Boltzmann, a pioneer of statistical physics, who studied the connection of the macroscopic thermodynamic behavior to the microscopic components of the system. For example, he described the temperature to be the average of the fluctuating kinetic energy of the particles. And he formulated the entropy to be

S = - k_B \sum_i p_i \log p_i,

where i is the label for each microstate, and k_B is the Boltzmann’s constant. And in a closed system, the total entropy never decreases.

Information Theory and Statistical Physics United

In statistical physics, Boltzmann’s assumption of equal a priori equilibrium properties is an important assumption. However, in 1957, E. T. Jaynes published a paper relating information theory and statistical physics in Physical Review indicating that merely the principle of maximum entropy is sufficient to describe equilibrium statistical system. [Jaynes 1957] In statistical physics, we are aware that systems can be described as canonical ensemble, or a softmax function (normalized exponential), i.e., p_i \propto \exp(-\beta E_i). This can be easily derived by the principle of maximum entropy and the conservation of energy. Or mathematically, the probabilities for all states i with energies E_i can be obtained by maximizing the entropy

S = -\sum_i p_i \log p_i,

under the constraints

\sum_i p_i = 1, and
\sum_i p_i E_i = E,

where E is a constant. The softmax distribution can be obtained by this simple optimization problem, using basic variational calculus (Euler-Lagrange equation) and Lagrange’s multipliers.

The principle of maximum entropy can be found in statistics too. For example, the form of Gaussian distribution can be obtained by maximizing the entropy

S = - \int dx \cdot p(x) \log p(x),

with the knowledge of the mean \mu and the variance \sigma^2, or mathematically speaking, under the constraints,

\int dx \cdot p(x) = 1,
\int dx \cdot x p(x) = \mu, and
\int dx \cdot (x-\mu)^2 p(x) = \sigma^2.

In any statistical systems, the probability distributions can be computed with the principle of maximum entropy, as Jaynes put it [Jaynes 1957]

It is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.

In statistical physics, entropy is roughly a measure how “chaotic” a system is. In information theory, entropy is a measure how surprising the information is. The smaller the entropy is, the more surprising the information is. And it assumes no additional information. Without constraints other than the normalization, the probability distribution is that all p_i‘s are equal, which is equivalent to the least surprise. Lê Nguyên Hoang, a scientist at Massachusetts Institute of Technology, wrote a good blog post about the meaning of entropy in information theory. [Hoang 2013] In information theory, the entropy is given by

S = -\sum_i p_i \log_2 p_i,

which is different from the thermodynamic entropy by the constant k_B and the coefficient \log 2. The entropies in information theory and statistical physics are equivalent.

Entropy in Natural Language Processing (NLP)

The principle of maximum entropy assumes nothing other than the given information to compute the most optimized probability distribution, which makes it a desirable algorithm in machine learning. It can be regarded as a supervised learning algorithm, with the features being {p, c}, where p is the property calculated, and c is the class. The probability for {p, c} is proportional to \exp(- \alpha \text{\#}({p, c})), where \alpha is the coefficient to be found during training. There are some technical note to compute all these coefficients, which essentially involves solving a system of algebraic equations numerically using techniques such as generalized iterative scaling (GIS).

Does it really assume no additional information? No. The way you construct the features is how you add information. But once the features are defined, the calculation depends on the training data only.

The classifier based on maximum entropy has found its application in part-of-speech (POS) tagging, machine translation (ML), speech recognition, and text mining. A good review was written by Berger and Della Pietra’s. [Berger, Della Pietra, Della Pietra 1996] A lot of open-source softwares provide maximum entropy classifiers, such as Python NLTK and Apache OpenNLP.

In Quantum Computation…

One last word, entropy is used to describe quantum entanglement. A composite bipartite quantum system is said to be entangled if its subsystems must be described in a mixed state, i.e., it must be statistical if one of the subsystems is only considered. Then the entanglement entropy is given by [Nielssen, Chuang 2011]

S = -\sum_i p_i \log p_i,

which is essentially the same formula. The more entangled the system is, the larger the entanglement entropy. However, composite quantum systems tend to decrease their entropy over time though.

Continue reading “The Legacy of Entropy”

R or Python on Text Mining

rpythontextmining

I have seen more than enough debates about R or Python. While I do have a preference towards Python, I am happy with using R as well. I am not agnostic about languages, but we choose tools according to needs. The needs may be about effectiveness, efficiency, availability of tools, nature of problems, collaborations, etc. Yes, in a nutshell, it depends.

When dealing with text mining, although I still prefer Python, I have to fairly say that both languages have their own strengths and weaknesses. What do you do in text mining? Let me casually list the usual steps:

  1. Removing special characters,
  2. Removing numerals,
  3. Converting all alphabets to lower cases,
  4. Removing stop words, and
  5. Stemming the words (using Porter stemmer).

They are standard steps. But of course, sometimes we perform lemmatization instead of stemming. Sometimes we keep numerals. Or whatever. It is okay.

How do u do that in Python? Suppose you have a list of text documents stored in the variable texts, which is defined by

texts = ['I love Python.',
         'R is good for analytics.',
         'Mathematics is fun.']

. Then

# import all necessary libraries
from nltk.stem import PorterStemmer
from nltk.tokenize import SpaceTokenizer
from nltk.corpus import stopwords
from functools import partial
from gensim import corpora
from gensim.models import TfidfModel
import re

# initialize the instances for various NLP tools
tokenizer = SpaceTokenizer()
stemmer = PorterStemmer()

# define each steps
pipeline = [lambda s: re.sub('[^\w\s]', '', s),
            lambda s: re.sub('[\d]', '', s),
            lambda s: s.lower(),
            lambda s: ' '.join(filter(lambda s: not (s in stopwords.words()), tokenizer.tokenize(s))),
            lambda s: ' '.join(map(lambda t: stemmer.stem(t), tokenizer.tokenize(s)))
           ]

# function that carries out the pipeline step-by-step
def preprocess_text(text, pipeline):
    if len(pipeline)==0:
        return text
    else:
        return preprocess_text(pipeline[0](text), pipeline[1:])

# preprocessing
preprocessed_texts = map(partial(preprocess_text, pipeline=pipeline), texts)

# converting to feature vectors
documents = map(lambda s: tokenizer.tokenize(s), texts)
corpus = [dictionary.doc2bow(document) for document in documents]
tfidfmodel = TfidfModel(corpus)

We can train a classifier with the feature vectors output by tfidfmodel. To do the prediction, we can get the feature vector for a new text by calling:

bow = dictionary.doc2bow(tokenizer.tokenize(preprocess_text(text, pipeline)))

How about in R? To perform the preprocessing steps and extract the feature vectors, run:

library(RTextTools)
library(tm)

origmatrix<-create_matrix(textColumns = texts, language = 'english',
                          removeNumbers = TRUE, toLower = TRUE,
                          removeStopwords = 'TRUE', stemWords = TRUE,
                          weighting=tm::weightTfIdf, originalMatrix=NULL)

After we have a trained classifier, and we have a new text to preprocess, then we run:

matrix<-create_matrix(textColumns = newtexts, language = 'english',
                      removeNumbers = TRUE, toLower = TRUE,
                      removeStopwords = 'TRUE', stemWords = TRUE,
                      weighting=tm::weightTfIdf, originalMatrix=origmatrix)

Actually, from this illustration, a strength for R stands out: brevity. However, very often we want to preprocess in other ways, Python allows more flexibility without making it complicated. And Python syntax itself is intuitive enough.

And there are more natural language processing libraries in Python available, such as nltk and gensim, that are associated with its other libraries perfectly such as numpy, scipy and scikit-learn. But R is not far away in terms of this actually, as it has libraries such as tm and RTextTools, while R does not have numpy-like libraries because R itself is designed to perform calculations like this.

Python can be used to develop larger software projects by making the codes reusable, and it is obviously a weakness for R.

However, do perform analysis, R makes the task very efficient if we do not require something unconventional.

In the area of text mining, R or Python? My answer is: it depends.

Continue reading “R or Python on Text Mining”

Blog at WordPress.com.

Up ↑