I have seen more than enough debates about R or Python. While I do have a preference towards Python, I am happy with using R as well. I am not agnostic about languages, but we choose tools according to needs. The needs may be about effectiveness, efficiency, availability of tools, nature of problems, collaborations, etc. Yes, in a nutshell, it depends.
When dealing with text mining, although I still prefer Python, I have to fairly say that both languages have their own strengths and weaknesses. What do you do in text mining? Let me casually list the usual steps:
- Removing special characters,
- Removing numerals,
- Converting all alphabets to lower cases,
- Removing stop words, and
- Stemming the words (using Porter stemmer).
They are standard steps. But of course, sometimes we perform lemmatization instead of stemming. Sometimes we keep numerals. Or whatever. It is okay.
How do u do that in Python? Suppose you have a list of text documents stored in the variable texts, which is defined by
texts = ['I love Python.', 'R is good for analytics.', 'Mathematics is fun.']
# import all necessary libraries from nltk.stem import PorterStemmer from nltk.tokenize import SpaceTokenizer from nltk.corpus import stopwords from functools import partial from gensim import corpora from gensim.models import TfidfModel import re # initialize the instances for various NLP tools tokenizer = SpaceTokenizer() stemmer = PorterStemmer() # define each steps pipeline = [lambda s: re.sub('[^\w\s]', '', s), lambda s: re.sub('[\d]', '', s), lambda s: s.lower(), lambda s: ' '.join(filter(lambda s: not (s in stopwords.words()), tokenizer.tokenize(s))), lambda s: ' '.join(map(lambda t: stemmer.stem(t), tokenizer.tokenize(s))) ] # function that carries out the pipeline step-by-step def preprocess_text(text, pipeline): if len(pipeline)==0: return text else: return preprocess_text(pipeline(text), pipeline[1:]) # preprocessing preprocessed_texts = map(partial(preprocess_text, pipeline=pipeline), texts) # converting to feature vectors documents = map(lambda s: tokenizer.tokenize(s), texts) corpus = [dictionary.doc2bow(document) for document in documents] tfidfmodel = TfidfModel(corpus)
We can train a classifier with the feature vectors output by tfidfmodel. To do the prediction, we can get the feature vector for a new text by calling:
bow = dictionary.doc2bow(tokenizer.tokenize(preprocess_text(text, pipeline)))
How about in R? To perform the preprocessing steps and extract the feature vectors, run:
library(RTextTools) library(tm) origmatrix<-create_matrix(textColumns = texts, language = 'english', removeNumbers = TRUE, toLower = TRUE, removeStopwords = 'TRUE', stemWords = TRUE, weighting=tm::weightTfIdf, originalMatrix=NULL)
After we have a trained classifier, and we have a new text to preprocess, then we run:
matrix<-create_matrix(textColumns = newtexts, language = 'english', removeNumbers = TRUE, toLower = TRUE, removeStopwords = 'TRUE', stemWords = TRUE, weighting=tm::weightTfIdf, originalMatrix=origmatrix)
Actually, from this illustration, a strength for R stands out: brevity. However, very often we want to preprocess in other ways, Python allows more flexibility without making it complicated. And Python syntax itself is intuitive enough.
And there are more natural language processing libraries in Python available, such as nltk and gensim, that are associated with its other libraries perfectly such as numpy, scipy and scikit-learn. But R is not far away in terms of this actually, as it has libraries such as tm and RTextTools, while R does not have numpy-like libraries because R itself is designed to perform calculations like this.
Python can be used to develop larger software projects by making the codes reusable, and it is obviously a weakness for R.
However, do perform analysis, R makes the task very efficient if we do not require something unconventional.
In the area of text mining, R or Python? My answer is: it depends.
- M. Saraswat, “Cheatsheet – Python & R codes for common Machine Learning Algorithms” (2015).