Short Text Categorization using Deep Neural Networks and Word-Embedding Models

There are situations that we deal with short text, probably messy, without a lot of training data. In that case, we need external semantic information. Instead of using the conventional bag-of-words (BOW) model, we should employ word-embedding models, such as Word2Vec, GloVe etc.

Suppose we want to perform supervised learning, with three subjects, described by the following Python dictionary:

classdict={'mathematics': ['linear algebra',
           'topology',
           'algebra',
           'calculus',
           'variational calculus',
           'functional field',
           'real analysis',
           'complex analysis',
           'differential equation',
           'statistics',
           'statistical optimization',
           'probability',
           'stochastic calculus',
           'numerical analysis',
           'differential geometry'],
          'physics': ['renormalization',
           'classical mechanics',
           'quantum mechanics',
           'statistical mechanics',
           'functional field',
           'path integral',
           'quantum field theory',
           'electrodynamics',
           'condensed matter',
           'particle physics',
           'topological solitons',
           'astrophysics',
           'spontaneous symmetry breaking',
           'atomic molecular and optical physics',
           'quantum chaos'],
          'theology': ['divine providence',
           'soteriology',
           'anthropology',
           'pneumatology',
           'Christology',
           'Holy Trinity',
           'eschatology',
           'scripture',
           'ecclesiology',
           'predestination',
           'divine degree',
           'creedal confessionalism',
           'scholasticism',
           'prayer',
           'eucharist']}

And we implemented Word2Vec here. To add external information, we use a pre-trained Word2Vec model from Google, downloaded here. We can use it with Python package gensim. To load it, enter

from gensim.models import Word2Vec
wvmodel = Word2Vec.load_word2vec_format('<path-to>/GoogleNews-vectors-negative300.bin.gz', binary=True)

How do we represent a phrase in Word2Vec? How do we do the classification? Here I wrote two classes to do it.

Average

We can represent a sentence by summing the word-embedding representations of each word. The class, inside SumWord2VecClassification.py, is coded as follow:

from collections import defaultdict

import numpy as np
from nltk import word_tokenize
from scipy.spatial.distance import cosine

from utils import ModelNotTrainedException

class SumEmbeddedVecClassifier:
    def __init__(self, wvmodel, classdict, vecsize=300):
        self.wvmodel = wvmodel
        self.classdict = classdict
        self.vecsize = vecsize
        self.trained = False

    def train(self):
        self.addvec = defaultdict(lambda : np.zeros(self.vecsize))
        for classtype in self.classdict:
            for shorttext in self.classdict[classtype]:
                self.addvec[classtype] += self.shorttext_to_embedvec(shorttext)
            self.addvec[classtype] /= np.linalg.norm(self.addvec[classtype])
        self.addvec = dict(self.addvec)
        self.trained = True

    def shorttext_to_embedvec(self, shorttext):
        vec = np.zeros(self.vecsize)
        tokens = word_tokenize(shorttext)
        for token in tokens:
            if token in self.wvmodel:
                vec += self.wvmodel[token]
        norm = np.linalg.norm(vec)
        if norm!=0:
            vec /= np.linalg.norm(vec)
        return vec

    def score(self, shorttext):
        if not self.trained:
            raise ModelNotTrainedException()
        vec = self.shorttext_to_embedvec(shorttext)
        scoredict = {}
        for classtype in self.addvec:
            try:
                scoredict[classtype] = 1 - cosine(vec, self.addvec[classtype])
            except ValueError:
                scoredict[classtype] = np.nan
        return scoredict

Here the exception ModelNotTrainedException is just an exception raised if the model has not been trained yet, but scoring function was called by the user. (Codes listed in my Github repository.) The similarity will be calculated by cosine similarity.

Such an implementation is easy to understand and carry out. It is good enough for a lot of application. However, it has the problem that it does not take the relation between words or word order into account.

Convolutional Neural Network

To tackle the problem of word relations, we have to use deeper neural networks. Yoon Kim published a well cited paper regarding this in EMNLP in 2014, titled “Convolutional Neural Networks for Sentence Classification.” The model architecture is as follow: (taken from his paper)

cnn

Each word is represented by an embedded vector, but neighboring words are related through the convolutional matrix. And MaxPooling and a dense neural network were implemented afterwards. His paper involves multiple filters with variable window sizes / spatial extent, but for our cases of short phrases, I just use one window of size 2 (similar to dealing with bigram). While Kim implemented using Theano (see his Github repository), I implemented using keras with Theano backend. The codes, inside CNNEmbedVecClassification.py, are as follow:

import numpy as np
from keras.layers import Convolution1D, MaxPooling1D, Flatten, Dense
from keras.models import Sequential
from nltk import word_tokenize

from utils import ModelNotTrainedException

class CNNEmbeddedVecClassifier:
    def __init__(self,
                 wvmodel,
                 classdict,
                 n_gram,
                 vecsize=300,
                 nb_filters=1200,
                 maxlen=15):
        self.wvmodel = wvmodel
        self.classdict = classdict
        self.n_gram = n_gram
        self.vecsize = vecsize
        self.nb_filters = nb_filters
        self.maxlen = maxlen
        self.trained = False

    def convert_trainingdata_matrix(self):
        classlabels = self.classdict.keys()
        lblidx_dict = dict(zip(classlabels, range(len(classlabels))))

        # tokenize the words, and determine the word length
        phrases = []
        indices = []
        for label in classlabels:
            for shorttext in self.classdict[label]:
                category_bucket = [0]*len(classlabels)
                category_bucket[lblidx_dict[label]] = 1
                indices.append(category_bucket)
                phrases.append(word_tokenize(shorttext))

        # store embedded vectors
        train_embedvec = np.zeros(shape=(len(phrases), self.maxlen, self.vecsize))
        for i in range(len(phrases)):
            for j in range(min(self.maxlen, len(phrases[i]))):
                train_embedvec[i, j] = self.word_to_embedvec(phrases[i][j])
        indices = np.array(indices, dtype=np.int)

        return classlabels, train_embedvec, indices

    def train(self):
        # convert classdict to training input vectors
        self.classlabels, train_embedvec, indices = self.convert_trainingdata_matrix()

        # build the deep neural network model
        model = Sequential()
        model.add(Convolution1D(nb_filter=self.nb_filters,
                                filter_length=self.n_gram,
                                border_mode='valid',
                                activation='relu',
                                input_shape=(self.maxlen, self.vecsize)))
        model.add(MaxPooling1D(pool_length=self.maxlen-self.n_gram+1))
        model.add(Flatten())
        model.add(Dense(len(self.classlabels), activation='softmax'))
        model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

        # train the model
        model.fit(train_embedvec, indices)

        # flag switch
        self.model = model
        self.trained = True

    def word_to_embedvec(self, word):
        return self.wvmodel[word] if word in self.wvmodel else np.zeros(self.vecsize)

    def shorttext_to_matrix(self, shorttext):
        tokens = word_tokenize(shorttext)
        matrix = np.zeros((self.maxlen, self.vecsize))
        for i in range(min(self.maxlen, len(tokens))):
            matrix[i] = self.word_to_embedvec(tokens[i])
        return matrix

    def score(self, shorttext):
        if not self.trained:
            raise ModelNotTrainedException()

        # retrieve vector
        matrix = np.array([self.shorttext_to_matrix(shorttext)])

        # classification using the neural network
        predictions = self.model.predict(matrix)

        # wrangle output result
        scoredict = {}
        for idx, classlabel in zip(range(len(self.classlabels)), self.classlabels):
            scoredict[classlabel] = predictions[0][idx]
        return scoredict

The output is a vector of length equal to the number of class labels, 3 in our example. The elements of the output vector add up to one, indicating its score, and a nature of probability.

Evaluation

A simple cross-validation to the example data set does not tell a difference between the two algorithms:

rplot_acc1

However, we can test the algorithm with a few examples:

Example 1: “renormalization”

  • Average: {‘mathematics’: 0.54135105096749336, ‘physics’: 0.63665460856632494, ‘theology’: 0.31014049736087901}
  • CNN: {‘mathematics’: 0.093827009201049805, ‘physics’: 0.85451591014862061, ‘theology’: 0.051657050848007202}

As renormalization was a strong word in the training data, it gives an easy result. CNN can distinguish much more clearly.

Example 2: “salvation”

  • Average: {‘mathematics’: 0.14939650156482298, ‘physics’: 0.21692765541184023, ‘theology’: 0.5698233329716329}
  • CNN: {‘mathematics’: 0.012395491823554039, ‘physics’: 0.022725773975253105, ‘theology’: 0.96487873792648315}

“Salvation” is not found in the training data, but it is closely related to “soteriology,” which means the doctrine of salvation. So it correctly identifies it with theology.

Example 3: “coffee”

  • Average: {‘mathematics’: 0.096820211601723272, ‘physics’: 0.081567332119268032, ‘theology’: 0.15962682945135631}
  • CNN: {‘mathematics’: 0.27321341633796692, ‘physics’: 0.1950736939907074, ‘theology’: 0.53171288967132568}

Coffee is not related to all subjects. The first architecture correctly indicates the fact, but CNN, with its probabilistic nature, has to roughly equally distribute it (but not so well.)

The code can be found in my Github repository: stephenhky/PyShortTextCategorization. (This repository has been updated since this article was published. The link shows the version of the code when this appeared online.)

Continue reading “Short Text Categorization using Deep Neural Networks and Word-Embedding Models”

Probabilistic Theory of Word Embeddings: GloVe

The topic of word embedding algorithms has been one of the interests of this blog, as in this entry, with Word2Vec [Mikilov et. al. 2013] as one of the main examples. It is a great tool for text mining, (for example, see [Czerny 2015],) as it reduces the dimensions needed (compared to bag-of-words model). As an algorithm borrowed from computer vision, a lot of these algorithms use deep learning methods to train the model, while it was not exactly sure why it works. Despite that, there are many articles talking about how to train the model. [Goldberg & Levy 2014, Rong 2014 etc.] Addition and subtraction of the word vectors show amazing relationships that carry semantic values, as I have shown in my previous blog entry. [Ho 2015]

However, Tomas Mikolov is no longer working in Google, making the development of this algorithm discontinued. As a follow-up of their work, Stanford NLP group later proposed a model, called GloVe (Global Vectors), that embeds word vectors using probabilistic methods. [Pennington, Socher & Manning 2014] It can be implemented in the package glove-python in python, and text2vec in R (or see their CRAN post).  Their paper is neatly written, and a highly recommended read.

To explain the theory of GloVe, we start with some basic probabilistic picture in basic natural language processing (NLP). We suppose the relation between the words occur in certain text windows within a corpus, but the details are not important here. Assume that i, j, and k are three words, and the conditional probability P_{ik} is defined as

P_{ij} = P(j | i) = \frac{X_{ij}}{X_i},

where X‘s are the counts, and similarly for P_{jk}. And we are interested in the following ratio:

F(w_i, w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}.

The tilde means “context,” but we will later assume it is also a word. Citing the example from their paper, take i as ice, and j as steam. if k is solid, then the ratio is expected to be large; or if k is gas, then it is expected to be low. But if k is water, which are related to both, or fashion, which is related to none, then the ratio is expected to be approximately 1.

And the addition and subtraction in Word2Vec is similar to this. We want the ratio to be like the subtraction as in Word2Vec (and multiplication as in addition), then we should modify the function F such that,

F(w_i - w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}.

On the other hand, the input arguments of F are vectors, but the output is a scalar. We avoid the issue by making the input argument as a dot product,

F( (w_i - w_j)^T \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}.

In NLP, the word-word co-occurrence matrices are symmetric, and our function F should also be invariant under switching the labeling. We first require F is be a homomorphism,

F((w_i - w_j)^T \tilde{w}_k) = \frac{F(w_i^T \tilde{w}_k) }{ F(w_j^T \tilde{w}_k)},

where we define,

F(w_i^T \tilde{w}_k) = P_{ik} = \frac{X_{ik}}{X_i}.

It is clear that F is an exponential function, but to ensure symmetry, we further define:

w_i^T \tilde{w}_k + b_i + \tilde{b}_k = \log X_{ik}.

As a result of this equation, the authors defined the following cost function to optimize for GloVe model:

J = \sum_{i, j=1}^V f(X_{ij}) \left( w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ik} \right)^2,

where w_j, \tilde{w}_j, b_i, and \tilde{b}_j are parameters to learn. f(x) is a weighting function. (Refer the details to the paper.) [Pennington, Socher & Manning 2014]

As Radim Řehůřek said in his blog entry, [Řehůřek 2014] it is a neat paper, but their evaluation is crappy.

This theory explained why certain similar relations can be achieved, such as Paris – France is roughly equal to Beijing – China, as both can be transformed to the ratio in the definition of F above.

It is a neat paper, as it employs optimization theory and probability theory, without any dark box deep learning.

Continue reading “Probabilistic Theory of Word Embeddings: GloVe”

Create a free website or blog at WordPress.com.

Up ↑