There are many tasks in natural language processing that are challenging. This blog entry is on text summarization, which briefly summarizes the survey article on this topic. (arXiv:1707.02268) The authors of the article defined the task to be

Automatic text summarization is the task of producing a concise and fluent summary while preserving key information content and overall meaning.

There are basically two approaches to this task:

• extractive summarization: identifying important sections of the text, and extracting them; and
• abstractive summarization: producing summary text in a new way.

Most algorithmic methods developed are of the extractive type, while most human writers summarize using abstractive approach. There are many methods in extractive approach, such as identifying given keywords, identifying sentences similar to the title, or wrangling the text at the beginning of the documents.

How do we instruct the machines to perform extractive summarization? The authors mentioned about two representations: topic and indicator. In topic representations, frequencies, tf-idf, latent semantic indexing (LSI), or topic models (such as latent Dirichlet allocation, LDA) are used. However, simply extracting these sentences out with these algorithms may not generate a readable summary. Employment of knowledge bases or considering contexts (from web search, e-mail conversation threads, scientific articles, author styles etc.) are useful.

In indicator representation, the authors mentioned the graph methods, inspired by PageRank. (see this) “Sentences form vertices of the graph and edges between the sentences indicate how similar the two sentences are.” And the key sentences are identified with ranking algorithms. Of course, machine learning methods can be used too.

Evaluation on the performance on text summarization is difficult. Human evaluation is unavoidable, but with manual approaches, some statistics can be calculated, such as ROUGE.

In text mining, it is important to create the document-term matrix (DTM) of the corpus we are interested in. A DTM is basically a matrix, with documents designated by rows and words by columns, that the elements are the counts or the weights (usually by tf-idf). Subsequent analysis is usually based creatively on DTM.

Exploring with DTM therefore becomes an important issues with a good text-mining tool. How do we perform exploratory data analysis on DTM using R and Python? We will demonstrate it using the data set of U. S. Presidents’ Inaugural Address, preprocessed, and can be downloaded here.

# R: textmineR

In R, we can use the package textmineR, which has been in introduced in a previous post. Together with other packages such as dplyr (for tidy data analysis) and snowBall (for stemming), load all of them at the beginning:

library(dplyr)
library(textmineR)
library(SnowballC)


usprez.df<- read.csv('inaugural.csv', stringsAsFactors = FALSE)


Then we create the DTM, while we remove all digits and punctuations, make all letters lowercase, and stem all words using Porter stemmer.

dtm<- CreateDtm(usprez.df$speech, doc_names = usprez.df$yrprez,
ngram_window = c(1, 1),
lower = TRUE,
remove_punctuation = TRUE,
remove_numbers = TRUE,
stem_lemma_function = wordStem)


Then defining a set of functions:

get.doc.tokens<- function(dtm, docid)
dtm[docid, ] %>% as.data.frame() %>% rename(count=".") %>%
mutate(token=row.names(.)) %>% arrange(-count)

get.token.occurrences<- function(dtm, token)
dtm[, token] %>% as.data.frame() %>% rename(count=".") %>%
mutate(token=row.names(.)) %>% arrange(-count)

get.total.freq<- function(dtm, token) dtm[, token] %>% sum

get.doc.freq<- function(dtm, token)
dtm[, token] %>% as.data.frame() %>% rename(count=".") %>%
filter(count>0) %>% pull(count) %>% length


Then we can happily extract information. For example, if we want to get the top-most common words in 2009’s Obama’s speech, enter:

dtm %>% get.doc.tokens('2009-Obama') %>% head(10)


Or which speeches have the word “change”: (but need to stem the word before extraction)

dtm %>% get.token.occurrences(wordStem('change')) %>% head(10)


You can also get the total number of occurrence of the words by:

dtm %>% get.doc.freq(wordStem('change'))   # gives 28


# Python: shorttext

In Python, similar things can be done using the package shorttext, described in a previous post. It uses other packages such as pandas and stemming. Load all packages first:

import shorttext
import numpy as np
import pandas as pd
from stemming.porter import stem

import re


And define the preprocessing pipelines:

pipeline = [lambda s: re.sub('[^\w\s]', '', s),
lambda s: re.sub('[\d]', '', s),
lambda s: s.lower(),
lambda s: ' '.join(map(stem, shorttext.utils.tokenize(s)))
]
txtpreproceesor = shorttext.utils.text_preprocessor(pipeline)


The function <code>txtpreprocessor</code> above perform the functions we talked about in R.

usprezdf = pd.read_csv('inaugural.csv')


The corpus needs to be preprocessed before putting into the DTM:

docids = list(usprezdf['yrprez'])    # defining document IDs
corpus = [txtpreproceesor(speech).split(' ') for speech in usprezdf['speech']]


Then create the DTM:

dtm = shorttext.utils.DocumentTermMatrix(corpus, docids=docids, tfidf=False)


Then we do the same thing as we have done above. To get the top-most common words in 2009’s Obama’s speech, enter:

dtm.get_doc_tokens('2009-Obama')


Or we look up which speeches have the word “change”:

dtm.get_token_occurences(stem('change'))


Or to get the document frequency of the word:

dtm.get_doc_frequency(stem('change'))


They Python and R codes give different document frequencies probably because the two stemmers work slightly differently.

The Python package for text mining shorttext has a new release: 0.5.4. It can be installed by typing in the command line:

pip install -U shorttext

For some people, you may need to install it from “root”, i.e., adding sudo in front of the command. Since the version 0.5 (including releases 0.5.1 and 0.5.4), there have been substantial addition of functionality, mostly about comparisons between short phrases without running a supervised or unsupervised machine learning algorithm, but calculating the “similarity” with various metrics, including:

• soft Jaccard score (the same kind of fuzzy scores based on edit distance in SOCcer),
• Word Mover’s distance (WMD, detailedly described in a previous post), and
• Jaccard index due to word-embedding model.

For the soft Jaccard score due to edit distance, we can call it by:

>>> from shorttext.metrics.dynprog import soft_jaccard_score
>>> soft_jaccard_score(['book', 'seller'], ['blok', 'sellers'])     # gives 0.6716417910447762
>>> soft_jaccard_score(['police', 'station'], ['policeman'])        # gives 0.2857142857142858

The core of this code was written in C, and interfaced to Python using SWIG.

For the Word Mover’s Distance (WMD), while the source codes are the same as my previous post, it can now be called directly. First, load the modules and the word-embedding model:

>>> from shorttext.metrics.wasserstein import word_mover_distance
>>> from shorttext.utils import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')

And compute the WMD with a single function:

>>> word_mover_distance(['police', 'station'], ['policeman'], wvmodel)                      # gives 3.060708999633789
>>> word_mover_distance(['physician', 'assistant'], ['doctor', 'assistants'], wvmodel)      # gives 2.276337146759033

And the Jaccard index due to cosine distance in Word-embedding model can be called like this:

>>> from shorttext.metrics.embedfuzzy import jaccardscore_sents
>>> jaccardscore_sents('doctor', 'physician', wvmodel)   # gives 0.6401538990056869
>>> jaccardscore_sents('chief executive', 'computer cluster', wvmodel)   # gives 0.0022515450768836143
>>> jaccardscore_sents('topological data', 'data of topology', wvmodel)   # gives 0.67588977344632573

Most new functions can be found in this tutorial.

And there are some minor bugs fixed.

Much about the use of word-embedding models such as Word2Vec and GloVe have been covered. However, how to measure the similarity between phrases or documents? One natural choice is the cosine similarity, as I have toyed with in a previous post. However, it smoothed out the influence of each word. Two years ago, a group in Washington University in St. Louis proposed the Word Mover’s Distance (WMD) in a PMLR paper that captures the relations between words, not simply by distance, but also the “transportation” from one phrase to another conveyed by each word. This Word Mover’s Distance (WMD) can be seen as a special case of Earth Mover’s Distance (EMD), or Wasserstein distance, the one people talked about in Wasserstein GAN. This is better than bag-of-words (BOW) model in a way that the word vectors capture the semantic similarities between words.

# Word Mover’s Distance (WMD)

The formulation of WMD is beautiful. Consider the embedded word vectors $\mathbf{X} \in R^{d \times n}$, where $d$ is the dimension of the embeddings, and $n$ is the number of words. For each phrase, there is a normalized BOW vector $d \in R^n$, and $d_i = \frac{c_i}{\sum_i c_i}$, where $i$‘s denote the word tokens. The distance between words are the Euclidean distance of their embedded word vectors, denoted by $c(i, j) = || \mathbf{x}_i - \mathbf{x}_j ||_2$, where $i$ and $j$ denote word tokens. The document distance, which is WMD here, is defined by $\sum_{i, j} \mathbf{T}_{i j} c(i, j)$, where $\mathbf{T}$ is a $n \times n$ matrix. Each element $\mathbf{T}_{ij} \geq 0$ denote how nuch of word $i$ in the first document (denoted by $\mathbf{d}$) travels to word $j$ in the new document (denoted by $\mathbf{d}'$).

Then the problem becomes the minimization of the document distance, or the WMD, and is formulated as:

$\text{min}_{\mathbf{T} \geq 0} \sum_{i, j=1}^n \mathbf{T}_{ij} c(i, j)$,

given the constraints:

$\sum_{j=1}^n \mathbf{T}_{ij} = d_i$, and

$\sum_{i=1}^n \mathbf{T}_{ij} = d_j'$.

This is essentially a simplified case of the Earth Mover’s distance (EMD), or the Wasserstein distance. (See the review by Gibbs and Su.)

# Using PuLP

The WMD is essentially a linear optimization problem. There are many optimization packages on the market, and my stance is that, for those common ones, there are no packages that are superior than others. In my job, I happened to handle a missing data problem, in turn becoming a non-linear optimization problem with linear constraints, and I chose limSolve, after I shop around. But I actually like a lot of other packages too. For WMD problem, I first tried out cvxopt first, which should actually solve the exact same problem, but the indexing is hard to maintain. Because I am dealing with words, it is good to have a direct hash map, or a dictionary. I can use the Dictionary class in gensim. But I later found out I should use PuLP, as it allows indices with words as a hash map (dict in Python), and WMD is a linear programming problem, making PuLP is a perfect choice, considering code efficiency.

An example of using PuLP can be demonstrated by the British 1997 UG Exam, as in the first problem of this link, with the Jupyter Notebook demonstrating this.

# Implementation of WMD using PuLP

The demonstration can be found in the Jupyter Notebook.

Load the necessary packages:

from itertools import product
from collections import defaultdict

import numpy as np
from scipy.spatial.distance import euclidean
import pulp
import gensim

Then define the functions the gives the BOW document vectors:

def tokens_to_fracdict(tokens):
cntdict = defaultdict(lambda : 0)
for token in tokens:
cntdict[token] += 1
totalcnt = sum(cntdict.values())
return {token: float(cnt)/totalcnt for token, cnt in cntdict.items()}

Then implement the core calculation. Note that PuLP is actually a symbolic computing package. This function return a pulp.LpProblem class:

def word_mover_distance_probspec(first_sent_tokens, second_sent_tokens, wvmodel, lpFile=None):
all_tokens = list(set(first_sent_tokens+second_sent_tokens))
wordvecs = {token: wvmodel[token] for token in all_tokens}

first_sent_buckets = tokens_to_fracdict(first_sent_tokens)
second_sent_buckets = tokens_to_fracdict(second_sent_tokens)

T = pulp.LpVariable.dicts('T_matrix', list(product(all_tokens, all_tokens)), lowBound=0)

prob = pulp.LpProblem('WMD', sense=pulp.LpMinimize)
prob += pulp.lpSum([T[token1, token2]*euclidean(wordvecs[token1], wordvecs[token2])
for token1, token2 in product(all_tokens, all_tokens)])
for token2 in second_sent_buckets:
prob += pulp.lpSum([T[token1, token2] for token1 in first_sent_buckets])==second_sent_buckets[token2]
for token1 in first_sent_buckets:
prob += pulp.lpSum([T[token1, token2] for token2 in second_sent_buckets])==first_sent_buckets[token1]

if lpFile!=None:
prob.writeLP(lpFile)

prob.solve()

return prob

To extract the value, just run pulp.value(prob.objective)

We use Google Word2Vec. Refer the $\mathbf{T}$ matrices in the Jupyter Notebook. Running this by a few examples:

1. document1 = President, talk, Chicago
document2 = President, speech, Illinois
WMD = 2.88587622936
2. document1 = physician, assistant
document2 = doctor
WMD = 2.8760048151
3. document1 = physician, assistant
document2 = doctor, assistant
WMD = 1.00465738773
(compare with example 2!)
4. document1 = doctors, assistant
document2 = doctor, assistant
WMD = 1.02825379372
(compare with example 3!)
5. document1 = doctor, assistant
document2 = doctor, assistant
WMD = 0.0
(totally identical; compare with example 3!)

There are more examples in the notebook.

# Conclusion

WMD is a good metric comparing two documents or sentences, by capturing the semantic meanings of the words. It is more powerful than BOW model as it captures the meaning similarities; it is more powerful than the cosine distance between average word vectors, as the transfer of meaning using words from one document to another is considered. But it is not immune to the problem of misspelling.

This algorithm works well for short texts. However, when the documents become large, this formulation will be computationally expensive. The author actually suggested a few modifications, such as the removal of constraints, and word centroid distances.

Example codes can be found in my Github repository: stephenhky/PyWMD.

On 07/28/2017, shorttext published its release 0.4.1, with a few important updates. To install it, type the following in the OS X / Linux command line:

>>> pip install -U shorttext

The documentation in PythonHosted.org has been abandoned. It has been migrated to readthedocs.org. (URL: http://shorttext.readthedocs.io/ or http:// shorttext.rtfd.io)

## Exploiting the Word-Embedding Layer

This update is mainly due to an important update in gensim, motivated by earlier shorttext‘s effort in integrating scikit-learn and keras. And gensim also provides a keras layer, on the same footing as other neural networks, activation function, or dropout layers, for Word2Vec models. Because shorttext has been making use of keras layers for categorization, such advance in gensim in fact makes it a natural step to add an embedding layer of all neural networks provided in shorttext. How to do it? (See shorttext tutorial for “Deep Neural Networks with Word Embedding.”)

import shorttext
wvmodel = shorttext.utils.load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz') &nbsp; # load the pre-trained Word2Vec model
trainclassdict = shorttext.data.subjectkeywords() &nbsp; # load an example data set


To train a model, you can do it the old way, or do it the new way with additional gensim function:

kmodel = shorttext.classifiers.frameworks.CNNWordEmbed(wvmodel=wvmodel, nb_labels=len(trainclassdict.keys()), vecsize=100, with_gensim=True) &nbsp; # keras model, setting with_gensim=True
classifier = shorttext.classifiers.VarNNEmbeddedVecClassifier(wvmodel, with_gensim=True, vecsize=100) &nbsp; # instantiate the classifier, setting with_gensim=True
classifier.train(trainclassdict, kmodel)


The parameters with_gensim in both CNNWordEmbed and VarNNEmbeddedVecClassifier are set to be False by default, because of backward compatibility. However, setting it to be True will enable it to use the new gensim Word2Vec layer.

These change in gensim and shorttext are the works mainly contributed by Chinmaya Pancholi, a very bright student at Indian Institute of Technology, Kharagpur, and a GSoC (Google Summer of Code) student in 2017. He revolutionized gensim by integrating scikit-learn and keras into gensim. He also used what he did in gensim to improve the pipelines of shorttext. He provided valuable technical suggestions. You can read his GSoC proposal, and his blog posts in RaRe Technologies, Inc. Chinmaya has been diligently mentored by Ivan Menshikh and Lev Konstantinovskiy of RaRe Technologies.

## Maxent Classifier

Another important update is the adding of maximum entropy (maxent) classifier. (See the corresponding tutorial on “Maximum Entropy (MaxEnt) Classifier.”) I will devote a separate entry on the theory, but it is very easy to use it,

import shorttext
from shorttext.classifiers import MaxEntClassifier

classifier = MaxEntClassifier()


Use the NIHReports dataset as the example:

classdict = shorttext.data.nihreports()
classifier.train(classdict, nb_epochs=1000)


The classification is just like other classifiers provided by shorttext:

classifier.score('cancer immunology') # NCI tops the score
classifier.score('children health') # NIAID tops the score
classifier.score('Alzheimer disease and aging') # NIAID tops the score


The package shorttext has received attention for the past two months. A new release is released yesterday for the following updates:

1. Removal attempts of loading GloVe model, as it can be run using gensim script;
2. Confirmed compatibility of the package with Tensorflow;
3. Use of spacy for tokenization, instead of nltk;
4. Use of stemming for Porter stemmer, instead of nltk;
5. Removal of nltk dependencies;
6. Simplifying the directory and module structures;
7. Module packages updated.

For #1, it actually removed a bug in the previous release. Instead, the users should convert the GloVe models into Word2Vec using the script provided by gensim.

For #3, #4, and #5, it is basically removing any nltk dependencies, because very few functionalities of nltk was used, and it is slow. For Porter stemmer, there is a light-weighted library stemming that performs the task perfectly. For tokenization, the tokenizer in spaCy is significantly faster than nltk, as shown in this Jupyter Notebook. We can do a simple test here, by first importing:

import time
import shorttext


Then load the NIH data:

nihdata = shorttext.data.nihreports()
nihtext = ' '.join(map(lambda item: ' '.join(item[1]), nihdata.items()))


Then find the time of using the tokenizer in nltk:

from nltk import word_tokenize

nltkt0 = time.time()
tokens = word_tokenize(nihtext)
nltkt1 = time.time()
print nltkt1-nltkt0, ' sec'   # output: 0.0224239826202 sec


On the other hand, using spaCy gives:

import spacy

spt0 = time.time()
doc = nlp(unicode(nihtext))
tokens1 = [token for token in doc]
tokens1 = map(str, tokens1)
spt1 = time.time()

print spt1-spt0, ' sec'   # output: 0.00799107551575 sec


Clearly, spaCy is three times faster.

#6 indicates a simplification of package structure. Previously, for example, the neural network framework was in shorttext.classifiers.embed.nnlib.frameworks, but now it is shorttext.classifiers.frameworks. But the old package structure is kept for backward compatibility.

There has been a lot of methods for natural language processing and text mining. However, in tweets, surveys, Facebook, or many online data, texts are short, lacking data to build enough information. Traditional bag-of-words (BOW) model gives sparse vector representation.

Semantic relations between words are important, because we usually do not have enough data to capture the similarity between words. We do not want “drive” and “drives,” or “driver” and “chauffeur” to be completely different.

The relation between or order of words become important as well. Or we want to capture the concepts that may be correlated in our training dataset.

We have to represent these texts in a special way and perform supervised learning with traditional machine learning algorithms or deep learning algorithms.

This package shorttext‘ was designed to tackle all these problems. It is not a completely new invention, but putting everything known together. It contains the following features:

• example data provided (including subject keywords and NIH RePORT);
• text preprocessing;
• pre-trained word-embedding support;
• gensim topic models (LDA, LSI, Random Projections) and autoencoder;
• topic model representation supported for supervised learning using scikit-learn;
• cosine distance classification; and
• neural network classification (including ConvNet, and C-LSTM).

Readers can refer this to the documentation.

There are situations that we deal with short text, probably messy, without a lot of training data. In that case, we need external semantic information. Instead of using the conventional bag-of-words (BOW) model, we should employ word-embedding models, such as Word2Vec, GloVe etc.

Suppose we want to perform supervised learning, with three subjects, described by the following Python dictionary:

classdict={'mathematics': ['linear algebra',
'topology',
'algebra',
'calculus',
'variational calculus',
'functional field',
'real analysis',
'complex analysis',
'differential equation',
'statistics',
'statistical optimization',
'probability',
'stochastic calculus',
'numerical analysis',
'differential geometry'],
'physics': ['renormalization',
'classical mechanics',
'quantum mechanics',
'statistical mechanics',
'functional field',
'path integral',
'quantum field theory',
'electrodynamics',
'condensed matter',
'particle physics',
'topological solitons',
'astrophysics',
'spontaneous symmetry breaking',
'atomic molecular and optical physics',
'quantum chaos'],
'theology': ['divine providence',
'soteriology',
'anthropology',
'pneumatology',
'Christology',
'Holy Trinity',
'eschatology',
'scripture',
'ecclesiology',
'predestination',
'divine degree',
'creedal confessionalism',
'scholasticism',
'prayer',
'eucharist']}


And we implemented Word2Vec here. To add external information, we use a pre-trained Word2Vec model from Google, downloaded here. We can use it with Python package gensim. To load it, enter

from gensim.models import Word2Vec


How do we represent a phrase in Word2Vec? How do we do the classification? Here I wrote two classes to do it.

#### Average

We can represent a sentence by summing the word-embedding representations of each word. The class, inside SumWord2VecClassification.py, is coded as follow:

from collections import defaultdict

import numpy as np
from nltk import word_tokenize
from scipy.spatial.distance import cosine

from utils import ModelNotTrainedException

class SumEmbeddedVecClassifier:
def __init__(self, wvmodel, classdict, vecsize=300):
self.wvmodel = wvmodel
self.classdict = classdict
self.vecsize = vecsize
self.trained = False

def train(self):
self.addvec = defaultdict(lambda : np.zeros(self.vecsize))
for classtype in self.classdict:
for shorttext in self.classdict[classtype]:
self.trained = True

def shorttext_to_embedvec(self, shorttext):
vec = np.zeros(self.vecsize)
tokens = word_tokenize(shorttext)
for token in tokens:
if token in self.wvmodel:
vec += self.wvmodel[token]
norm = np.linalg.norm(vec)
if norm!=0:
vec /= np.linalg.norm(vec)
return vec

def score(self, shorttext):
if not self.trained:
raise ModelNotTrainedException()
vec = self.shorttext_to_embedvec(shorttext)
scoredict = {}
for classtype in self.addvec:
try:
scoredict[classtype] = 1 - cosine(vec, self.addvec[classtype])
except ValueError:
scoredict[classtype] = np.nan
return scoredict


Here the exception ModelNotTrainedException is just an exception raised if the model has not been trained yet, but scoring function was called by the user. (Codes listed in my Github repository.) The similarity will be calculated by cosine similarity.

Such an implementation is easy to understand and carry out. It is good enough for a lot of application. However, it has the problem that it does not take the relation between words or word order into account.

#### Convolutional Neural Network

To tackle the problem of word relations, we have to use deeper neural networks. Yoon Kim published a well cited paper regarding this in EMNLP in 2014, titled “Convolutional Neural Networks for Sentence Classification.” The model architecture is as follow: (taken from his paper)

Each word is represented by an embedded vector, but neighboring words are related through the convolutional matrix. And MaxPooling and a dense neural network were implemented afterwards. His paper involves multiple filters with variable window sizes / spatial extent, but for our cases of short phrases, I just use one window of size 2 (similar to dealing with bigram). While Kim implemented using Theano (see his Github repository), I implemented using keras with Theano backend. The codes, inside CNNEmbedVecClassification.py, are as follow:

import numpy as np
from keras.layers import Convolution1D, MaxPooling1D, Flatten, Dense
from keras.models import Sequential
from nltk import word_tokenize

from utils import ModelNotTrainedException

class CNNEmbeddedVecClassifier:
def __init__(self,
wvmodel,
classdict,
n_gram,
vecsize=300,
nb_filters=1200,
maxlen=15):
self.wvmodel = wvmodel
self.classdict = classdict
self.n_gram = n_gram
self.vecsize = vecsize
self.nb_filters = nb_filters
self.maxlen = maxlen
self.trained = False

def convert_trainingdata_matrix(self):
classlabels = self.classdict.keys()
lblidx_dict = dict(zip(classlabels, range(len(classlabels))))

# tokenize the words, and determine the word length
phrases = []
indices = []
for label in classlabels:
for shorttext in self.classdict[label]:
category_bucket = [0]*len(classlabels)
category_bucket[lblidx_dict[label]] = 1
indices.append(category_bucket)
phrases.append(word_tokenize(shorttext))

# store embedded vectors
train_embedvec = np.zeros(shape=(len(phrases), self.maxlen, self.vecsize))
for i in range(len(phrases)):
for j in range(min(self.maxlen, len(phrases[i]))):
train_embedvec[i, j] = self.word_to_embedvec(phrases[i][j])
indices = np.array(indices, dtype=np.int)

return classlabels, train_embedvec, indices

def train(self):
# convert classdict to training input vectors
self.classlabels, train_embedvec, indices = self.convert_trainingdata_matrix()

# build the deep neural network model
model = Sequential()
filter_length=self.n_gram,
border_mode='valid',
activation='relu',
input_shape=(self.maxlen, self.vecsize)))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

# train the model
model.fit(train_embedvec, indices)

# flag switch
self.model = model
self.trained = True

def word_to_embedvec(self, word):
return self.wvmodel[word] if word in self.wvmodel else np.zeros(self.vecsize)

def shorttext_to_matrix(self, shorttext):
tokens = word_tokenize(shorttext)
matrix = np.zeros((self.maxlen, self.vecsize))
for i in range(min(self.maxlen, len(tokens))):
matrix[i] = self.word_to_embedvec(tokens[i])
return matrix

def score(self, shorttext):
if not self.trained:
raise ModelNotTrainedException()

# retrieve vector
matrix = np.array([self.shorttext_to_matrix(shorttext)])

# classification using the neural network
predictions = self.model.predict(matrix)

# wrangle output result
scoredict = {}
for idx, classlabel in zip(range(len(self.classlabels)), self.classlabels):
scoredict[classlabel] = predictions[0][idx]
return scoredict
`

The output is a vector of length equal to the number of class labels, 3 in our example. The elements of the output vector add up to one, indicating its score, and a nature of probability.

#### Evaluation

A simple cross-validation to the example data set does not tell a difference between the two algorithms:

However, we can test the algorithm with a few examples:

Example 1: “renormalization”

• Average: {‘mathematics’: 0.54135105096749336, ‘physics’: 0.63665460856632494, ‘theology’: 0.31014049736087901}
• CNN: {‘mathematics’: 0.093827009201049805, ‘physics’: 0.85451591014862061, ‘theology’: 0.051657050848007202}

As renormalization was a strong word in the training data, it gives an easy result. CNN can distinguish much more clearly.

Example 2: “salvation”

• Average: {‘mathematics’: 0.14939650156482298, ‘physics’: 0.21692765541184023, ‘theology’: 0.5698233329716329}
• CNN: {‘mathematics’: 0.012395491823554039, ‘physics’: 0.022725773975253105, ‘theology’: 0.96487873792648315}

“Salvation” is not found in the training data, but it is closely related to “soteriology,” which means the doctrine of salvation. So it correctly identifies it with theology.

Example 3: “coffee”

• Average: {‘mathematics’: 0.096820211601723272, ‘physics’: 0.081567332119268032, ‘theology’: 0.15962682945135631}
• CNN: {‘mathematics’: 0.27321341633796692, ‘physics’: 0.1950736939907074, ‘theology’: 0.53171288967132568}

Coffee is not related to all subjects. The first architecture correctly indicates the fact, but CNN, with its probabilistic nature, has to roughly equally distribute it (but not so well.)

The code can be found in my Github repository: stephenhky/PyShortTextCategorization. (This repository has been updated since this article was published. The link shows the version of the code when this appeared online.)

There are many tasks that involve coding, for example, putting kids into groups according to their age, labeling the webpages about their kinds, or putting students in Hogwarts into four colleges… And researchers or lawyers need to code people, according to their filled-in information, into occupations. Melissa Friesen, an investigator in Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NCI), National Institutes of Health (NIH), saw the need of large-scale coding. Many researchers are dealing with big data concerning epidemiology. She led a research project, in collaboration with Office of Intramural Research (OIR), Center for Information Technology (CIT), National Institutes of Health (NIH), to develop an artificial intelligence system to cope with the problem. This leads to a publicly available tool called SOCcer, an acronym for “Standardized Occupation Coding for Computer-assisted Epidemiological Research.” (URL: http://soccer.nci.nih.gov/soccer/)

The system was initially developed in an attempt to find the correlation between the onset of cancers and other diseases and the occupation. “The application is not intended to replace expert coders, but rather to prioritize which job descriptions would benefit most from expert review,” said Friesen in an interview. She mainly works with Daniel Russ in CIT.

SOCcer takes job title, industry codes (in terms of SIC, Standard Industrial Classification), and job duties, and gives an occupational code called SOC 2010 (Standard Occupational Classification), used by U. S. federal government agencies. The data involves short text, often messy. There are 840 codes in SOC 2010 systems. Conventional natural language processing (NLP) methods may not apply. Friesen, Russ, and Kwan-Yuet (Stephen) Ho (also in OIR, CIT; a CSRA staff) use fuzzy logic, and maximum entropy (maxent) methods, with some feature engineering, to build various classifiers. These classifiers are aggregated together, as in stacked generalization (see my previous entry), using logistic regression, to give a final score.

SOCcer has a companion software, called SOCAssign, for expert coders to prioritize the codings. It was awarded with DCEG Informatics Tool Challenge 2015. SOCcer itself was awarded in 2016. And the SOCcer team was awarded for Scientific Award of Merit by CIT/OCIO in 2016 as well (see this). Their work was published in Occup. Environ. Med.