Much about the use of word-embedding models such as Word2Vec and GloVe have been covered. However, how to measure the similarity between phrases or documents? One natural choice is the cosine similarity, as I have toyed with in a previous post. However, it smoothed out the influence of each word. Two years ago, a group in Washington University in St. Louis proposed the Word Mover’s Distance (WMD) in a PMLR paper that captures the relations between words, not simply by distance, but also the “transportation” from one phrase to another conveyed by each word. This Word Mover’s Distance (WMD) can be seen as a special case of Earth Mover’s Distance (EMD), or Wasserstein distance, the one people talked about in Wasserstein GAN. This is better than bag-of-words (BOW) model in a way that the word vectors capture the semantic similarities between words.

# Word Mover’s Distance (WMD)

The formulation of WMD is beautiful. Consider the embedded word vectors $\mathbf{X} \in R^{d \times n}$, where $d$ is the dimension of the embeddings, and $n$ is the number of words. For each phrase, there is a normalized BOW vector $d \in R^n$, and $d_i = \frac{c_i}{\sum_i c_i}$, where $i$‘s denote the word tokens. The distance between words are the Euclidean distance of their embedded word vectors, denoted by $c(i, j) = || \mathbf{x}_i - \mathbf{x}_j ||_2$, where $i$ and $j$ denote word tokens. The document distance, which is WMD here, is defined by $\sum_{i, j} \mathbf{T}_{i j} c(i, j)$, where $\mathbf{T}$ is a $n \times n$ matrix. Each element $\mathbf{T}_{ij} \geq 0$ denote how nuch of word $i$ in the first document (denoted by $\mathbf{d}$) travels to word $j$ in the new document (denoted by $\mathbf{d}'$).

Then the problem becomes the minimization of the document distance, or the WMD, and is formulated as:

$\text{min}_{\mathbf{T} \geq 0} \sum_{i, j=1}^n \mathbf{T}_{ij} c(i, j)$,

given the constraints:

$\sum_{j=1}^n \mathbf{T}_{ij} = d_i$, and

$\sum_{i=1}^n \mathbf{T}_{ij} = d_j'$.

This is essentially a simplified case of the Earth Mover’s distance (EMD), or the Wasserstein distance. (See the review by Gibbs and Su.)

# Using PuLP

The WMD is essentially a linear optimization problem. There are many optimization packages on the market, and my stance is that, for those common ones, there are no packages that are superior than others. In my job, I happened to handle a missing data problem, in turn becoming a non-linear optimization problem with linear constraints, and I chose limSolve, after I shop around. But I actually like a lot of other packages too. For WMD problem, I first tried out cvxopt first, which should actually solve the exact same problem, but the indexing is hard to maintain. Because I am dealing with words, it is good to have a direct hash map, or a dictionary. I can use the Dictionary class in gensim. But I later found out I should use PuLP, as it allows indices with words as a hash map (dict in Python), and WMD is a linear programming problem, making PuLP is a perfect choice, considering code efficiency.

An example of using PuLP can be demonstrated by the British 1997 UG Exam, as in the first problem of this link, with the Jupyter Notebook demonstrating this.

# Implementation of WMD using PuLP

The demonstration can be found in the Jupyter Notebook.

Load the necessary packages:

from itertools import product
from collections import defaultdict

import numpy as np
from scipy.spatial.distance import euclidean
import pulp
import gensim

Then define the functions the gives the BOW document vectors:

def tokens_to_fracdict(tokens):
cntdict = defaultdict(lambda : 0)
for token in tokens:
cntdict[token] += 1
totalcnt = sum(cntdict.values())
return {token: float(cnt)/totalcnt for token, cnt in cntdict.items()}

Then implement the core calculation. Note that PuLP is actually a symbolic computing package. This function return a pulp.LpProblem class:

def word_mover_distance_probspec(first_sent_tokens, second_sent_tokens, wvmodel, lpFile=None):
all_tokens = list(set(first_sent_tokens+second_sent_tokens))
wordvecs = {token: wvmodel[token] for token in all_tokens}

first_sent_buckets = tokens_to_fracdict(first_sent_tokens)
second_sent_buckets = tokens_to_fracdict(second_sent_tokens)

T = pulp.LpVariable.dicts('T_matrix', list(product(all_tokens, all_tokens)), lowBound=0)

prob = pulp.LpProblem('WMD', sense=pulp.LpMinimize)
prob += pulp.lpSum([T[token1, token2]*euclidean(wordvecs[token1], wordvecs[token2])
for token1, token2 in product(all_tokens, all_tokens)])
for token2 in second_sent_buckets:
prob += pulp.lpSum([T[token1, token2] for token1 in first_sent_buckets])==second_sent_buckets[token2]
for token1 in first_sent_buckets:
prob += pulp.lpSum([T[token1, token2] for token2 in second_sent_buckets])==first_sent_buckets[token1]

if lpFile!=None:
prob.writeLP(lpFile)

prob.solve()

return prob

To extract the value, just run pulp.value(prob.objective)

We use Google Word2Vec. Refer the $\mathbf{T}$ matrices in the Jupyter Notebook. Running this by a few examples:

1. document1 = President, talk, Chicago
document2 = President, speech, Illinois
WMD = 2.88587622936
2. document1 = physician, assistant
document2 = doctor
WMD = 2.8760048151
3. document1 = physician, assistant
document2 = doctor, assistant
WMD = 1.00465738773
(compare with example 2!)
4. document1 = doctors, assistant
document2 = doctor, assistant
WMD = 1.02825379372
(compare with example 3!)
5. document1 = doctor, assistant
document2 = doctor, assistant
WMD = 0.0
(totally identical; compare with example 3!)

There are more examples in the notebook.

# Conclusion

WMD is a good metric comparing two documents or sentences, by capturing the semantic meanings of the words. It is more powerful than BOW model as it captures the meaning similarities; it is more powerful than the cosine distance between average word vectors, as the transfer of meaning using words from one document to another is considered. But it is not immune to the problem of misspelling.

This algorithm works well for short texts. However, when the documents become large, this formulation will be computationally expensive. The author actually suggested a few modifications, such as the removal of constraints, and word centroid distances.

Example codes can be found in my Github repository: stephenhky/PyWMD.

On November 21, 2016, the Python package shorttext’ was published. Until today, more than seven versions have been published. There have been a drastic architecture change, but the overall purpose is still the same, as summarized in the first introduction entry:

This package shorttext‘ was designed to tackle all these problems… It contains the following features:

• example data provided (including subject keywords and NIH RePORT);
• text preprocessing;
• pre-trained word-embedding support;
• gensim topic models (LDA, LSI, Random Projections) and autoencoder;
• topic model representation supported for supervised learning using scikit-learn;
• cosine distance classification; and
• neural network classification (including ConvNet, and C-LSTM).

And since the first version, there have been updates, as summarized in the documention (News):

## Version 0.3.3 (Apr 19, 2017)

• Deleted CNNEmbedVecClassifier.
• Added script ShortTextWord2VecSimilarity.

## Version 0.3.2 (Mar 28, 2017)

• Bug fixed for gensim model I/O;
• Console scripts update;
• Neural networks up to Keras 2 standard (refer to this).

## Version 0.3.1 (Mar 14, 2017)

• Compact model I/O: all models are in single files;
• Implementation of stacked generalization using logistic regression.

## Version 0.2.1 (Feb 23, 2017)

• Removal attempts of loading GloVe model, as it can be run using gensim script;
• Confirmed compatibility of the package with tensorflow;
• Use of spacy for tokenization, instead of nltk;
• Use of stemming for Porter stemmer, instead of nltk;
• Removal of nltk dependencies;
• Simplifying the directory and module structures;
• Module packages updated.

Although there are still additions that I would love to add, but it would not change the overall architecture. I may add some more supervised learning algorithms, but under the same network. The upcoming big additions will be generative models or seq2seq models, but I do not see them coming in the short term. I will add corpuses.

I may add tutorials if I have time.

I am thankful that there is probably some external collaboration with other Python packages. Some people have already made some useful contributions. It will be updated if more things are confirmed.

There are situations that we deal with short text, probably messy, without a lot of training data. In that case, we need external semantic information. Instead of using the conventional bag-of-words (BOW) model, we should employ word-embedding models, such as Word2Vec, GloVe etc.

Suppose we want to perform supervised learning, with three subjects, described by the following Python dictionary:

classdict={'mathematics': ['linear algebra',
'topology',
'algebra',
'calculus',
'variational calculus',
'functional field',
'real analysis',
'complex analysis',
'differential equation',
'statistics',
'statistical optimization',
'probability',
'stochastic calculus',
'numerical analysis',
'differential geometry'],
'physics': ['renormalization',
'classical mechanics',
'quantum mechanics',
'statistical mechanics',
'functional field',
'path integral',
'quantum field theory',
'electrodynamics',
'condensed matter',
'particle physics',
'topological solitons',
'astrophysics',
'spontaneous symmetry breaking',
'atomic molecular and optical physics',
'quantum chaos'],
'theology': ['divine providence',
'soteriology',
'anthropology',
'pneumatology',
'Christology',
'Holy Trinity',
'eschatology',
'scripture',
'ecclesiology',
'predestination',
'divine degree',
'creedal confessionalism',
'scholasticism',
'prayer',
'eucharist']}


And we implemented Word2Vec here. To add external information, we use a pre-trained Word2Vec model from Google, downloaded here. We can use it with Python package gensim. To load it, enter

from gensim.models import Word2Vec


How do we represent a phrase in Word2Vec? How do we do the classification? Here I wrote two classes to do it.

#### Average

We can represent a sentence by summing the word-embedding representations of each word. The class, inside SumWord2VecClassification.py, is coded as follow:

from collections import defaultdict

import numpy as np
from nltk import word_tokenize
from scipy.spatial.distance import cosine

from utils import ModelNotTrainedException

class SumEmbeddedVecClassifier:
def __init__(self, wvmodel, classdict, vecsize=300):
self.wvmodel = wvmodel
self.classdict = classdict
self.vecsize = vecsize
self.trained = False

def train(self):
self.addvec = defaultdict(lambda : np.zeros(self.vecsize))
for classtype in self.classdict:
for shorttext in self.classdict[classtype]:
self.trained = True

def shorttext_to_embedvec(self, shorttext):
vec = np.zeros(self.vecsize)
tokens = word_tokenize(shorttext)
for token in tokens:
if token in self.wvmodel:
vec += self.wvmodel[token]
norm = np.linalg.norm(vec)
if norm!=0:
vec /= np.linalg.norm(vec)
return vec

def score(self, shorttext):
if not self.trained:
raise ModelNotTrainedException()
vec = self.shorttext_to_embedvec(shorttext)
scoredict = {}
for classtype in self.addvec:
try:
scoredict[classtype] = 1 - cosine(vec, self.addvec[classtype])
except ValueError:
scoredict[classtype] = np.nan
return scoredict


Here the exception ModelNotTrainedException is just an exception raised if the model has not been trained yet, but scoring function was called by the user. (Codes listed in my Github repository.) The similarity will be calculated by cosine similarity.

Such an implementation is easy to understand and carry out. It is good enough for a lot of application. However, it has the problem that it does not take the relation between words or word order into account.

#### Convolutional Neural Network

To tackle the problem of word relations, we have to use deeper neural networks. Yoon Kim published a well cited paper regarding this in EMNLP in 2014, titled “Convolutional Neural Networks for Sentence Classification.” The model architecture is as follow: (taken from his paper)

Each word is represented by an embedded vector, but neighboring words are related through the convolutional matrix. And MaxPooling and a dense neural network were implemented afterwards. His paper involves multiple filters with variable window sizes / spatial extent, but for our cases of short phrases, I just use one window of size 2 (similar to dealing with bigram). While Kim implemented using Theano (see his Github repository), I implemented using keras with Theano backend. The codes, inside CNNEmbedVecClassification.py, are as follow:

import numpy as np
from keras.layers import Convolution1D, MaxPooling1D, Flatten, Dense
from keras.models import Sequential
from nltk import word_tokenize

from utils import ModelNotTrainedException

class CNNEmbeddedVecClassifier:
def __init__(self,
wvmodel,
classdict,
n_gram,
vecsize=300,
nb_filters=1200,
maxlen=15):
self.wvmodel = wvmodel
self.classdict = classdict
self.n_gram = n_gram
self.vecsize = vecsize
self.nb_filters = nb_filters
self.maxlen = maxlen
self.trained = False

def convert_trainingdata_matrix(self):
classlabels = self.classdict.keys()
lblidx_dict = dict(zip(classlabels, range(len(classlabels))))

# tokenize the words, and determine the word length
phrases = []
indices = []
for label in classlabels:
for shorttext in self.classdict[label]:
category_bucket = [0]*len(classlabels)
category_bucket[lblidx_dict[label]] = 1
indices.append(category_bucket)
phrases.append(word_tokenize(shorttext))

# store embedded vectors
train_embedvec = np.zeros(shape=(len(phrases), self.maxlen, self.vecsize))
for i in range(len(phrases)):
for j in range(min(self.maxlen, len(phrases[i]))):
train_embedvec[i, j] = self.word_to_embedvec(phrases[i][j])
indices = np.array(indices, dtype=np.int)

return classlabels, train_embedvec, indices

def train(self):
# convert classdict to training input vectors
self.classlabels, train_embedvec, indices = self.convert_trainingdata_matrix()

# build the deep neural network model
model = Sequential()
filter_length=self.n_gram,
border_mode='valid',
activation='relu',
input_shape=(self.maxlen, self.vecsize)))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

# train the model
model.fit(train_embedvec, indices)

# flag switch
self.model = model
self.trained = True

def word_to_embedvec(self, word):
return self.wvmodel[word] if word in self.wvmodel else np.zeros(self.vecsize)

def shorttext_to_matrix(self, shorttext):
tokens = word_tokenize(shorttext)
matrix = np.zeros((self.maxlen, self.vecsize))
for i in range(min(self.maxlen, len(tokens))):
matrix[i] = self.word_to_embedvec(tokens[i])
return matrix

def score(self, shorttext):
if not self.trained:
raise ModelNotTrainedException()

# retrieve vector
matrix = np.array([self.shorttext_to_matrix(shorttext)])

# classification using the neural network
predictions = self.model.predict(matrix)

# wrangle output result
scoredict = {}
for idx, classlabel in zip(range(len(self.classlabels)), self.classlabels):
scoredict[classlabel] = predictions[0][idx]
return scoredict


The output is a vector of length equal to the number of class labels, 3 in our example. The elements of the output vector add up to one, indicating its score, and a nature of probability.

#### Evaluation

A simple cross-validation to the example data set does not tell a difference between the two algorithms:

However, we can test the algorithm with a few examples:

Example 1: “renormalization”

• Average: {‘mathematics’: 0.54135105096749336, ‘physics’: 0.63665460856632494, ‘theology’: 0.31014049736087901}
• CNN: {‘mathematics’: 0.093827009201049805, ‘physics’: 0.85451591014862061, ‘theology’: 0.051657050848007202}

As renormalization was a strong word in the training data, it gives an easy result. CNN can distinguish much more clearly.

Example 2: “salvation”

• Average: {‘mathematics’: 0.14939650156482298, ‘physics’: 0.21692765541184023, ‘theology’: 0.5698233329716329}
• CNN: {‘mathematics’: 0.012395491823554039, ‘physics’: 0.022725773975253105, ‘theology’: 0.96487873792648315}

“Salvation” is not found in the training data, but it is closely related to “soteriology,” which means the doctrine of salvation. So it correctly identifies it with theology.

Example 3: “coffee”

• Average: {‘mathematics’: 0.096820211601723272, ‘physics’: 0.081567332119268032, ‘theology’: 0.15962682945135631}
• CNN: {‘mathematics’: 0.27321341633796692, ‘physics’: 0.1950736939907074, ‘theology’: 0.53171288967132568}

Coffee is not related to all subjects. The first architecture correctly indicates the fact, but CNN, with its probabilistic nature, has to roughly equally distribute it (but not so well.)

The code can be found in my Github repository: stephenhky/PyShortTextCategorization. (This repository has been updated since this article was published. The link shows the version of the code when this appeared online.)

The topic of word embedding algorithms has been one of the interests of this blog, as in this entry, with Word2Vec [Mikilov et. al. 2013] as one of the main examples. It is a great tool for text mining, (for example, see [Czerny 2015],) as it reduces the dimensions needed (compared to bag-of-words model). As an algorithm borrowed from computer vision, a lot of these algorithms use deep learning methods to train the model, while it was not exactly sure why it works. Despite that, there are many articles talking about how to train the model. [Goldberg & Levy 2014, Rong 2014 etc.] Addition and subtraction of the word vectors show amazing relationships that carry semantic values, as I have shown in my previous blog entry. [Ho 2015]

However, Tomas Mikolov is no longer working in Google, making the development of this algorithm discontinued. As a follow-up of their work, Stanford NLP group later proposed a model, called GloVe (Global Vectors), that embeds word vectors using probabilistic methods. [Pennington, Socher & Manning 2014] It can be implemented in the package glove-python in python, and text2vec in R (or see their CRAN post).  Their paper is neatly written, and a highly recommended read.

To explain the theory of GloVe, we start with some basic probabilistic picture in basic natural language processing (NLP). We suppose the relation between the words occur in certain text windows within a corpus, but the details are not important here. Assume that $i$, $j$, and $k$ are three words, and the conditional probability $P_{ik}$ is defined as

$P_{ij} = P(j | i) = \frac{X_{ij}}{X_i}$,

where $X$‘s are the counts, and similarly for $P_{jk}$. And we are interested in the following ratio:

$F(w_i, w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$.

The tilde means “context,” but we will later assume it is also a word. Citing the example from their paper, take $i$ as ice, and $j$ as steam. if $k$ is solid, then the ratio is expected to be large; or if $k$ is gas, then it is expected to be low. But if $k$ is water, which are related to both, or fashion, which is related to none, then the ratio is expected to be approximately 1.

And the addition and subtraction in Word2Vec is similar to this. We want the ratio to be like the subtraction as in Word2Vec (and multiplication as in addition), then we should modify the function $F$ such that,

$F(w_i - w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$.

On the other hand, the input arguments of $F$ are vectors, but the output is a scalar. We avoid the issue by making the input argument as a dot product,

$F( (w_i - w_j)^T \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$.

In NLP, the word-word co-occurrence matrices are symmetric, and our function $F$ should also be invariant under switching the labeling. We first require $F$ is be a homomorphism,

$F((w_i - w_j)^T \tilde{w}_k) = \frac{F(w_i^T \tilde{w}_k) }{ F(w_j^T \tilde{w}_k)}$,

where we define,

$F(w_i^T \tilde{w}_k) = P_{ik} = \frac{X_{ik}}{X_i}$.

It is clear that $F$ is an exponential function, but to ensure symmetry, we further define:

$w_i^T \tilde{w}_k + b_i + \tilde{b}_k = \log X_{ik}$.

As a result of this equation, the authors defined the following cost function to optimize for GloVe model:

$J = \sum_{i, j=1}^V f(X_{ij}) \left( w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ik} \right)^2$,

where $w_j$, $\tilde{w}_j$, $b_i$, and $\tilde{b}_j$ are parameters to learn. $f(x)$ is a weighting function. (Refer the details to the paper.) [Pennington, Socher & Manning 2014]

As Radim Řehůřek said in his blog entry, [Řehůřek 2014] it is a neat paper, but their evaluation is crappy.

This theory explained why certain similar relations can be achieved, such as Paris – France is roughly equal to Beijing – China, as both can be transformed to the ratio in the definition of $F$ above.

It is a neat paper, as it employs optimization theory and probability theory, without any dark box deep learning.

One fascinating application of deep learning is the training of a model that outputs vectors representing words. A project written in Google, named Word2Vec, is one of the best tools regarding this. The vector representation captures the word contexts and relationships among words. This tool has been changing the landscape of natural language processing (NLP).

Let’s have some demonstration. To use Word2Vec in Python, you need to have the package gensim installed. (Installation instruction: here) And you have to download a trained model (GoogleNews-vectors-negative300.bin.gz), which is 3.6 GB big!! When you get into a Python shell (e.g., IPython), type

from gensim.models.word2vec import Word2Vec


This model enables the user to extract vector representation of length 300 of an English word. So what is so special about this vector representation from the traditional bag-of-words representation? First, the representation is standard. Once trained, we can use it in future training or testing dataset. Second, it captures the context of the word in a way that the algebraic operation of these vectors has meanings.

Here I give 5 examples.

A Juvenile Cat

What is a juvenile cat? We know that a juvenile dog is a puppy. Then we can get it by carry out the algebraic calculation $\text{puppy} - \text{dog} + \text{cat}$ by running

model.most_similar(positive=['puppy', 'cat'], negative=['dog'], topn=5)


This outputs:

[(u'kitten', 0.7634989619255066),
(u'puppies', 0.7110899686813354),
(u'pup', 0.6929495334625244),
(u'kittens', 0.6888389587402344),
(u'cats', 0.6796488761901855)]


which indicates that “kitten” is the answer (correctly!) The numbers are similarities of these words with the vector representation  $\text{puppy} - \text{dog} + \text{cat}$ in descending order. You can verify it by calculating the cosine distance:

from scipy.spatial import distance
print (1-distance.cosine(model['kitten'], model['puppy']+model['cat']-model['dog']))


which outputs 0.763498957413.

This demonstration shows that in the model, $\text{puppy}-\text{dog}$ and $\text{kitten}-\text{cat}$ are of similar semantic relations.

Capital of Taiwan

Where is the capital of Taiwan? We can find it if we know the capital of another country. For example, we know that Beijing is the capital of China. Then we can run the following:

model.most_similar(positive=['Beijing', 'Taiwan'], negative=['China'], topn=5)


which outputs

[(u'Taipei', 0.7866502404212952),
(u'Taiwanese', 0.6805002093315125),
(u'Kaohsiung', 0.6034111976623535),
(u'Chen', 0.5905819535255432),
(u'Seoul', 0.5865181684494019)]


Obviously, the answer is “Taipei.” And interestingly, the model sees Taiwan in the same footing of China!

Taipei (taken from Airasia: http://www.airasia.com/mo/en/destinations/taipei.page)

Past Participle of “eat”

We can extract grammatical information too. We know that the past participle of “go” is “gone”. With this, we can find that of “eat” by running:

model.most_similar(positive=[‘gone’, ‘eat’], negative=[‘go’], topn=5)

which outputs:

[(u'eaten', 0.7462186217308044),
(u'eating', 0.6516293287277222),
(u'ate', 0.6457351446151733),
(u'overeaten', 0.5853317975997925),
(u'eats', 0.5830586552619934)]


Capital of the State of Maryland

However, this model does not always work. If it can find the capital of Taiwan, can it find those for any states in the United States? We know that the capital of California is Sacramento. How about Maryland? Let’s run:

model.most_similar(positive=['Sacramento', 'Maryland'], negative=['California'], topn=5)


[(u'Towson', 0.7032245397567749),
(u'Baltimore', 0.6951349973678589),
(u'Hagerstown', 0.6367553472518921),
(u'Anne_Arundel', 0.5931429266929626),
(u'Oxon_Hill', 0.5879474878311157)]


But the correct answer should be Annapolis!

Downtown Annapolis (taken from Wikipedia)

Word2Vec was developed by Tomáš Mikolov. He previously worked for Microsoft Research. However, he switched to Google, and published a few influential works on Word2Vec. [Mikolov, Yih, Zweig 2013] [Mikolov, Sutskever, Chen, Corrado, Dean 2013] [Mikolov, Chen, Corrado, Dean 2013] Their conference paper in 2013 can be found on arXiv. He later published a follow-up work on a package called Doc2Vec that considers phrases. [Le, Mikolov 2014]

Earlier this year, I listened to a talk in DCNLP meetup spoken by Michael Czerny on his award-winning blog entry titled “Modern Methods for Sentiment Analysis.” He applied the vector representations of words by Word2Vec to perform sentiment analysis, assuming that similar sentiments cluster together in the vector space. (He took averages of the vectors in tweets to extract emotions.) [Czerny 2015] I highly recommend you to read his blog entry. On the other hand, Xin Rong wrote an explanation about how Word2Vec works too. [Rong 2014]

There seems to be no progress on the project Word2Vec anymore as Tomáš Mikolov no longer works in Google. However, the Stanford NLP Group recognized that Word2Vec captures the relations between words in their vector representation. They worked on a similar project, called GloVe (Global Vectors), which tackles the problem with matrix factorization. [Pennington, Socher, Manning 2014] Radim Řehůřek did some analysis comparing Word2Vec and GloVe. [Řehůřek 2014] GloVe can be implemented in Python too.

I have seen more than enough debates about R or Python. While I do have a preference towards Python, I am happy with using R as well. I am not agnostic about languages, but we choose tools according to needs. The needs may be about effectiveness, efficiency, availability of tools, nature of problems, collaborations, etc. Yes, in a nutshell, it depends.

When dealing with text mining, although I still prefer Python, I have to fairly say that both languages have their own strengths and weaknesses. What do you do in text mining? Let me casually list the usual steps:

1. Removing special characters,
2. Removing numerals,
3. Converting all alphabets to lower cases,
4. Removing stop words, and
5. Stemming the words (using Porter stemmer).

They are standard steps. But of course, sometimes we perform lemmatization instead of stemming. Sometimes we keep numerals. Or whatever. It is okay.

How do u do that in Python? Suppose you have a list of text documents stored in the variable texts, which is defined by

texts = ['I love Python.',
'R is good for analytics.',
'Mathematics is fun.']


. Then

# import all necessary libraries
from nltk.stem import PorterStemmer
from nltk.tokenize import SpaceTokenizer
from nltk.corpus import stopwords
from functools import partial
from gensim import corpora
from gensim.models import TfidfModel
import re

# initialize the instances for various NLP tools
tokenizer = SpaceTokenizer()
stemmer = PorterStemmer()

# define each steps
pipeline = [lambda s: re.sub('[^\w\s]', '', s),
lambda s: re.sub('[\d]', '', s),
lambda s: s.lower(),
lambda s: ' '.join(filter(lambda s: not (s in stopwords.words()), tokenizer.tokenize(s))),
lambda s: ' '.join(map(lambda t: stemmer.stem(t), tokenizer.tokenize(s)))
]

# function that carries out the pipeline step-by-step
def preprocess_text(text, pipeline):
if len(pipeline)==0:
return text
else:
return preprocess_text(pipeline[0](text), pipeline[1:])

# preprocessing
preprocessed_texts = map(partial(preprocess_text, pipeline=pipeline), texts)

# converting to feature vectors
documents = map(lambda s: tokenizer.tokenize(s), texts)
corpus = [dictionary.doc2bow(document) for document in documents]
tfidfmodel = TfidfModel(corpus)


We can train a classifier with the feature vectors output by tfidfmodel. To do the prediction, we can get the feature vector for a new text by calling:

bow = dictionary.doc2bow(tokenizer.tokenize(preprocess_text(text, pipeline)))


How about in R? To perform the preprocessing steps and extract the feature vectors, run:

library(RTextTools)
library(tm)

origmatrix<-create_matrix(textColumns = texts, language = 'english',
removeNumbers = TRUE, toLower = TRUE,
removeStopwords = 'TRUE', stemWords = TRUE,
weighting=tm::weightTfIdf, originalMatrix=NULL)


After we have a trained classifier, and we have a new text to preprocess, then we run:

matrix<-create_matrix(textColumns = newtexts, language = 'english',
removeNumbers = TRUE, toLower = TRUE,
removeStopwords = 'TRUE', stemWords = TRUE,
weighting=tm::weightTfIdf, originalMatrix=origmatrix)


Actually, from this illustration, a strength for R stands out: brevity. However, very often we want to preprocess in other ways, Python allows more flexibility without making it complicated. And Python syntax itself is intuitive enough.

And there are more natural language processing libraries in Python available, such as nltk and gensim, that are associated with its other libraries perfectly such as numpy, scipy and scikit-learn. But R is not far away in terms of this actually, as it has libraries such as tm and RTextTools, while R does not have numpy-like libraries because R itself is designed to perform calculations like this.

Python can be used to develop larger software projects by making the codes reusable, and it is obviously a weakness for R.

However, do perform analysis, R makes the task very efficient if we do not require something unconventional.

In the area of text mining, R or Python? My answer is: it depends.