R or Python on Text Mining


I have seen more than enough debates about R or Python. While I do have a preference towards Python, I am happy with using R as well. I am not agnostic about languages, but we choose tools according to needs. The needs may be about effectiveness, efficiency, availability of tools, nature of problems, collaborations, etc. Yes, in a nutshell, it depends.

When dealing with text mining, although I still prefer Python, I have to fairly say that both languages have their own strengths and weaknesses. What do you do in text mining? Let me casually list the usual steps:

  1. Removing special characters,
  2. Removing numerals,
  3. Converting all alphabets to lower cases,
  4. Removing stop words, and
  5. Stemming the words (using Porter stemmer).

They are standard steps. But of course, sometimes we perform lemmatization instead of stemming. Sometimes we keep numerals. Or whatever. It is okay.

How do u do that in Python? Suppose you have a list of text documents stored in the variable texts, which is defined by

texts = ['I love Python.',
         'R is good for analytics.',
         'Mathematics is fun.']

. Then

# import all necessary libraries
from nltk.stem import PorterStemmer
from nltk.tokenize import SpaceTokenizer
from nltk.corpus import stopwords
from functools import partial
from gensim import corpora
from gensim.models import TfidfModel
import re

# initialize the instances for various NLP tools
tokenizer = SpaceTokenizer()
stemmer = PorterStemmer()

# define each steps
pipeline = [lambda s: re.sub('[^\w\s]', '', s),
            lambda s: re.sub('[\d]', '', s),
            lambda s: s.lower(),
            lambda s: ' '.join(filter(lambda s: not (s in stopwords.words()), tokenizer.tokenize(s))),
            lambda s: ' '.join(map(lambda t: stemmer.stem(t), tokenizer.tokenize(s)))

# function that carries out the pipeline step-by-step
def preprocess_text(text, pipeline):
    if len(pipeline)==0:
        return text
        return preprocess_text(pipeline[0](text), pipeline[1:])

# preprocessing
preprocessed_texts = map(partial(preprocess_text, pipeline=pipeline), texts)

# converting to feature vectors
documents = map(lambda s: tokenizer.tokenize(s), texts)
corpus = [dictionary.doc2bow(document) for document in documents]
tfidfmodel = TfidfModel(corpus)

We can train a classifier with the feature vectors output by tfidfmodel. To do the prediction, we can get the feature vector for a new text by calling:

bow = dictionary.doc2bow(tokenizer.tokenize(preprocess_text(text, pipeline)))

How about in R? To perform the preprocessing steps and extract the feature vectors, run:


origmatrix<-create_matrix(textColumns = texts, language = 'english',
                          removeNumbers = TRUE, toLower = TRUE,
                          removeStopwords = 'TRUE', stemWords = TRUE,
                          weighting=tm::weightTfIdf, originalMatrix=NULL)

After we have a trained classifier, and we have a new text to preprocess, then we run:

matrix<-create_matrix(textColumns = newtexts, language = 'english',
                      removeNumbers = TRUE, toLower = TRUE,
                      removeStopwords = 'TRUE', stemWords = TRUE,
                      weighting=tm::weightTfIdf, originalMatrix=origmatrix)

Actually, from this illustration, a strength for R stands out: brevity. However, very often we want to preprocess in other ways, Python allows more flexibility without making it complicated. And Python syntax itself is intuitive enough.

And there are more natural language processing libraries in Python available, such as nltk and gensim, that are associated with its other libraries perfectly such as numpy, scipy and scikit-learn. But R is not far away in terms of this actually, as it has libraries such as tm and RTextTools, while R does not have numpy-like libraries because R itself is designed to perform calculations like this.

Python can be used to develop larger software projects by making the codes reusable, and it is obviously a weakness for R.

However, do perform analysis, R makes the task very efficient if we do not require something unconventional.

In the area of text mining, R or Python? My answer is: it depends.

Continue reading “R or Python on Text Mining”

Starting the Journey of Topological Data Analysis (TDA)

Topology has been around for centuries, but it did not catch the attention of many data analysts until recently. In an article published in Nature Scientific Reports, the authors demonstrated the power of topology in data analysis through examples including gene expression from breast rumors, voting data in the United States, and player performance data from the NBA. [Lum et. al. 2013]

As an introduction, they described topological methods “as a geometric approach to pattern or shape recognition within data.” It is true that in machine learning, we never care enough pattern recognition, but topology adds insights regarding the shapes of data that do not change with continuous deformation. For example, a circle and an ellipse have “the same topology.” The distances between data points are not as important as the shape. Traditional machine learning methods deal with feature vectors, distances, or classifications, but the topology of the data is usually discarded. Gunnar Carlsson demonstrated in a blog that a thin ellipse of data may be misrepresented as two straight parallel lines or one straight lines. [Carlsson 2015] Dimensionality reduction algorithms such as principal component analysis (PCA) often disregard the topology as well. (I heard that Kohenen’s self-organizing maps (SOM) [Kohonen 2000] retain the topology of higher dimensional data during the dimensionality reduction, but I am not confident enough to say that.)

Euler introduced the concept of topology in the 18th century. Topology has been a big subject in physics since 1950s. The string theory, as one of the many efforts in unifying gravity and other three fundamental forces, employs topological dimensions. In condensed matter physics, the fractional quantum Hall effect is a topological quantum effect. There are topological solitons [Rajaraman 1987] such as quantum vortices in superfluids, [Simula, Blakie 2006; Calzetta, Ho, Hu 2010] columns of topological solitons (believed to be Skyrmions) in helical magnets, [Mühlbauer et. al. 2009; Ho et. al. 2010; Ho 2012] hexagonal solitonic objects in smectic liquid crystals [Matsumoto et. al. 2009]… When a field becomes sophisticated, it becomes quantitative; when a quantitative field becomes sophisticated, it requires abstract mathematics such as topology for a general description. I believe analysis on any kinds of data is no exception.

There are some good reviews and readings about topological data analysis (TDA) out there, for example, the ones by Gunnar Carlsson [Carlsson 2009] and Afra Zomorodian [Zomorodian 2011]. While physicists talk about homotopy, data analysts talk about persistent homology as it is easier to compute. Data have to be described in a simplicial complex or a graph/network. Then the homology can be computed and represented in various ways such as barcodes. [Ghrist 2008] Then we extract insights about the data from it.

Topology has a steep learning curve. I am also a starter learning about this. This blog entry will not be the last talking about TDA. Therefore, I opened a new session called TDA for all of my blog entries about it. Let’s start the journey!

There is an R package called “TDA” that facilitates topological data analysis. [Fasy et. al. 2014] A taste of homology of a simplicial complex is also demonstrated in a Wolfram demo.

(Taken from TheGuardian)

Continue reading “Starting the Journey of Topological Data Analysis (TDA)”

Choices of Tools

When dealing with data analytics, what kind of things do we usually spend most of our time on?

I would say data cleaning and modeling.

Therefore, it is not merely software development. While we sometimes spend a lot of time in software architecture (which is important), before doing that, we have to explore what we want. Very often, data come in various formats, or we need to manually clean them. And very often we do not know which algorithms to use. We need to explore different ways to perform the experiments before determining what to include in the software project.

That’s why interactive programming comes into place for analytics project. R and MATLAB are these examples. However, they provide poor support for modularizing the codes. Python is a good tool that supports both modularization and interactive programming, but it takes an environment to run Python, which is very often a pain. Provided that a lot of good libraries are written in Java, having the need to perform both software development and data analytics, Scala, a JVM language that supports interactive programming, will be the next generation of programming language.


Useful Python Packages

(Taken from http://latticeqcd.org/pythonorg/static/images/antigravity.png, adapted from http://xkcd.com/353/)

Python is the basic programming languages if one wants to work on data nowadays. Its popularity comes with its intuitive syntax, its support of several programming paradigms, and the package numpy (Numerical Python). Yes, if you asked which package is a “must-have” outside the standard Python packages, I would certainly name numpy.

Let me list some useful packages that I have found useful:

  1. numpy: Numerical Python. Its basic data type is ndarray, which acts like a vector with vectorized calculation support. It makes Python to perform matrix calculation efficiently like MATLAB and Octave. It supports a lot of commonly used linear algebraic algorithms, such as eigenvalue problems, SVD etc. It is the basic of a lot of other Python packages that perform heavy numerical computation. It is such an important package that, in some operating systems, numpy comes with Python as well.
  2. scipy: Scientific Python. It needs numpy, but it supports also sparse matrices, special functions, statistics, numerical integration…
  3. matplotlib: Graph plotting.
  4. scikit-learn: machine learning library. It contains a number of supervised and unsupervised learning algorithms.
  5. nltk: natural language processing. It provides not only basic tools like stemmers, lemmatizers, but also some algorithms like maximum entropy, tf-idf vectorizer etc. It provides a few corpuses, and supports WordNet dictionary.
  6. gensim: another useful natural language processing package with an emphasis on topic modeling. It mainly supports Word2Vec, latent semantic indexing (LSI), and latent Dirichlet allocation (LDA). It is convenient to construct term-document matrices, and convert them to matrices in numpy or scipy.
  7. networkx: a package that supports both undirected and directed graphs. It provides basic algorithms used in graphs.
  8. sympy: Symbolic Python. I am not good at this package, but I know mathics and SageMath are both based on it.
  9. pandas: it supports data frame handling like R. (I have not used this package as I am a heavy R user.)

Of course, if you are a numerical developer, to save you a good life, install Anaconda.

There are some other packages that are useful, such as PyCluster (clustering), xlrd (Excel files read/write), PyGame (writing games)… But since I have not used them, I would rather mention it in this last paragraph, not to endorse but avoid devaluing it.

Don’t forget to type in your IPython Notebook:

import antigravity

Continue reading “Useful Python Packages”

Scala as the Next Influential Programming Language

I have been learning Scala. Some time ago, I doubted if it’s worth it as the learning curve is quite steep. But today I read the first chapter of my newly ordered book, titled Advanced Analytics Using Spark, a tool written in Scala for handling big data analytics, I reassured that I bet on the right thing.

I believe it will be the most common programming language the coming generation in this big data era because:

  1. It runs on JVM: a lot of libraries have been maintained as Java packages. Why do we discard Java if everything is getting more perfect from time to time? It is the same reason why we do not discard our old Fortran codes in scientific computing, but to wrap them in MATLAB or Python.
  2. It is an object-oriented: we learned about modularization and design patterns all the time. It keeps the strength of Java.
  3. It is functional: analytics involve functions. We want to handle functions flexibly. It shortens our codes, and makes our codes more readable (provided that we write appropriately). Mathematical manipulation is easier when we can handle operations with fewer codes. Lambda expressions are available.
  4. Interactive programming is available: what makes R and Python great is its availability to program interactively, especially handling data and mathematical models. And yes, this is also available in Scala.
  5. Parallel computing comes naturally: with actors or additional packages like Spark, Scala is well suited for scalable huge data computing. This is something that R and Python lack.


Continue reading “Scala as the Next Influential Programming Language”

Blog at WordPress.com.

Up ↑