Release of shorttext 0.3.3

On November 21, 2016, the Python package `shorttext’ was published. Until today, more than seven versions have been published. There have been a drastic architecture change, but the overall purpose is still the same, as summarized in the first introduction entry:

This package `shorttext‘ was designed to tackle all these problems… It contains the following features:

  • example data provided (including subject keywords and NIH RePORT);
  • text preprocessing;
  • pre-trained word-embedding support;
  • gensim topic models (LDA, LSI, Random Projections) and autoencoder;
  • topic model representation supported for supervised learning using scikit-learn;
  • cosine distance classification; and
  • neural network classification (including ConvNet, and C-LSTM).

And since the first version, there have been updates, as summarized in the documention (News):

Version 0.3.3 (Apr 19, 2017)

  • Deleted CNNEmbedVecClassifier.
  • Added script ShortTextWord2VecSimilarity.

Version 0.3.2 (Mar 28, 2017)

  • Bug fixed for gensim model I/O;
  • Console scripts update;
  • Neural networks up to Keras 2 standard (refer to this).

Version 0.3.1 (Mar 14, 2017)

  • Compact model I/O: all models are in single files;
  • Implementation of stacked generalization using logistic regression.

Version 0.2.1 (Feb 23, 2017)

  • Removal attempts of loading GloVe model, as it can be run using gensim script;
  • Confirmed compatibility of the package with tensorflow;
  • Use of spacy for tokenization, instead of nltk;
  • Use of stemming for Porter stemmer, instead of nltk;
  • Removal of nltk dependencies;
  • Simplifying the directory and module structures;
  • Module packages updated.

Although there are still additions that I would love to add, but it would not change the overall architecture. I may add some more supervised learning algorithms, but under the same network. The upcoming big additions will be generative models or seq2seq models, but I do not see them coming in the short term. I will add corpuses.

I may add tutorials if I have time.

I am thankful that there is probably some external collaboration with other Python packages. Some people have already made some useful contributions. It will be updated if more things are confirmed.

Continue reading “Release of shorttext 0.3.3”

Short Text Categorization using Deep Neural Networks and Word-Embedding Models

There are situations that we deal with short text, probably messy, without a lot of training data. In that case, we need external semantic information. Instead of using the conventional bag-of-words (BOW) model, we should employ word-embedding models, such as Word2Vec, GloVe etc.

Suppose we want to perform supervised learning, with three subjects, described by the following Python dictionary:

classdict={'mathematics': ['linear algebra',
           'topology',
           'algebra',
           'calculus',
           'variational calculus',
           'functional field',
           'real analysis',
           'complex analysis',
           'differential equation',
           'statistics',
           'statistical optimization',
           'probability',
           'stochastic calculus',
           'numerical analysis',
           'differential geometry'],
          'physics': ['renormalization',
           'classical mechanics',
           'quantum mechanics',
           'statistical mechanics',
           'functional field',
           'path integral',
           'quantum field theory',
           'electrodynamics',
           'condensed matter',
           'particle physics',
           'topological solitons',
           'astrophysics',
           'spontaneous symmetry breaking',
           'atomic molecular and optical physics',
           'quantum chaos'],
          'theology': ['divine providence',
           'soteriology',
           'anthropology',
           'pneumatology',
           'Christology',
           'Holy Trinity',
           'eschatology',
           'scripture',
           'ecclesiology',
           'predestination',
           'divine degree',
           'creedal confessionalism',
           'scholasticism',
           'prayer',
           'eucharist']}

And we implemented Word2Vec here. To add external information, we use a pre-trained Word2Vec model from Google, downloaded here. We can use it with Python package gensim. To load it, enter

from gensim.models import Word2Vec
wvmodel = Word2Vec.load_word2vec_format('<path-to>/GoogleNews-vectors-negative300.bin.gz', binary=True)

How do we represent a phrase in Word2Vec? How do we do the classification? Here I wrote two classes to do it.

Average

We can represent a sentence by summing the word-embedding representations of each word. The class, inside SumWord2VecClassification.py, is coded as follow:

from collections import defaultdict

import numpy as np
from nltk import word_tokenize
from scipy.spatial.distance import cosine

from utils import ModelNotTrainedException

class SumEmbeddedVecClassifier:
    def __init__(self, wvmodel, classdict, vecsize=300):
        self.wvmodel = wvmodel
        self.classdict = classdict
        self.vecsize = vecsize
        self.trained = False

    def train(self):
        self.addvec = defaultdict(lambda : np.zeros(self.vecsize))
        for classtype in self.classdict:
            for shorttext in self.classdict[classtype]:
                self.addvec[classtype] += self.shorttext_to_embedvec(shorttext)
            self.addvec[classtype] /= np.linalg.norm(self.addvec[classtype])
        self.addvec = dict(self.addvec)
        self.trained = True

    def shorttext_to_embedvec(self, shorttext):
        vec = np.zeros(self.vecsize)
        tokens = word_tokenize(shorttext)
        for token in tokens:
            if token in self.wvmodel:
                vec += self.wvmodel[token]
        norm = np.linalg.norm(vec)
        if norm!=0:
            vec /= np.linalg.norm(vec)
        return vec

    def score(self, shorttext):
        if not self.trained:
            raise ModelNotTrainedException()
        vec = self.shorttext_to_embedvec(shorttext)
        scoredict = {}
        for classtype in self.addvec:
            try:
                scoredict[classtype] = 1 - cosine(vec, self.addvec[classtype])
            except ValueError:
                scoredict[classtype] = np.nan
        return scoredict

Here the exception ModelNotTrainedException is just an exception raised if the model has not been trained yet, but scoring function was called by the user. (Codes listed in my Github repository.) The similarity will be calculated by cosine similarity.

Such an implementation is easy to understand and carry out. It is good enough for a lot of application. However, it has the problem that it does not take the relation between words or word order into account.

Convolutional Neural Network

To tackle the problem of word relations, we have to use deeper neural networks. Yoon Kim published a well cited paper regarding this in EMNLP in 2014, titled “Convolutional Neural Networks for Sentence Classification.” The model architecture is as follow: (taken from his paper)

cnn

Each word is represented by an embedded vector, but neighboring words are related through the convolutional matrix. And MaxPooling and a dense neural network were implemented afterwards. His paper involves multiple filters with variable window sizes / spatial extent, but for our cases of short phrases, I just use one window of size 2 (similar to dealing with bigram). While Kim implemented using Theano (see his Github repository), I implemented using keras with Theano backend. The codes, inside CNNEmbedVecClassification.py, are as follow:

import numpy as np
from keras.layers import Convolution1D, MaxPooling1D, Flatten, Dense
from keras.models import Sequential
from nltk import word_tokenize

from utils import ModelNotTrainedException

class CNNEmbeddedVecClassifier:
    def __init__(self,
                 wvmodel,
                 classdict,
                 n_gram,
                 vecsize=300,
                 nb_filters=1200,
                 maxlen=15):
        self.wvmodel = wvmodel
        self.classdict = classdict
        self.n_gram = n_gram
        self.vecsize = vecsize
        self.nb_filters = nb_filters
        self.maxlen = maxlen
        self.trained = False

    def convert_trainingdata_matrix(self):
        classlabels = self.classdict.keys()
        lblidx_dict = dict(zip(classlabels, range(len(classlabels))))

        # tokenize the words, and determine the word length
        phrases = []
        indices = []
        for label in classlabels:
            for shorttext in self.classdict[label]:
                category_bucket = [0]*len(classlabels)
                category_bucket[lblidx_dict[label]] = 1
                indices.append(category_bucket)
                phrases.append(word_tokenize(shorttext))

        # store embedded vectors
        train_embedvec = np.zeros(shape=(len(phrases), self.maxlen, self.vecsize))
        for i in range(len(phrases)):
            for j in range(min(self.maxlen, len(phrases[i]))):
                train_embedvec[i, j] = self.word_to_embedvec(phrases[i][j])
        indices = np.array(indices, dtype=np.int)

        return classlabels, train_embedvec, indices

    def train(self):
        # convert classdict to training input vectors
        self.classlabels, train_embedvec, indices = self.convert_trainingdata_matrix()

        # build the deep neural network model
        model = Sequential()
        model.add(Convolution1D(nb_filter=self.nb_filters,
                                filter_length=self.n_gram,
                                border_mode='valid',
                                activation='relu',
                                input_shape=(self.maxlen, self.vecsize)))
        model.add(MaxPooling1D(pool_length=self.maxlen-self.n_gram+1))
        model.add(Flatten())
        model.add(Dense(len(self.classlabels), activation='softmax'))
        model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

        # train the model
        model.fit(train_embedvec, indices)

        # flag switch
        self.model = model
        self.trained = True

    def word_to_embedvec(self, word):
        return self.wvmodel[word] if word in self.wvmodel else np.zeros(self.vecsize)

    def shorttext_to_matrix(self, shorttext):
        tokens = word_tokenize(shorttext)
        matrix = np.zeros((self.maxlen, self.vecsize))
        for i in range(min(self.maxlen, len(tokens))):
            matrix[i] = self.word_to_embedvec(tokens[i])
        return matrix

    def score(self, shorttext):
        if not self.trained:
            raise ModelNotTrainedException()

        # retrieve vector
        matrix = np.array([self.shorttext_to_matrix(shorttext)])

        # classification using the neural network
        predictions = self.model.predict(matrix)

        # wrangle output result
        scoredict = {}
        for idx, classlabel in zip(range(len(self.classlabels)), self.classlabels):
            scoredict[classlabel] = predictions[0][idx]
        return scoredict

The output is a vector of length equal to the number of class labels, 3 in our example. The elements of the output vector add up to one, indicating its score, and a nature of probability.

Evaluation

A simple cross-validation to the example data set does not tell a difference between the two algorithms:

rplot_acc1

However, we can test the algorithm with a few examples:

Example 1: “renormalization”

  • Average: {‘mathematics’: 0.54135105096749336, ‘physics’: 0.63665460856632494, ‘theology’: 0.31014049736087901}
  • CNN: {‘mathematics’: 0.093827009201049805, ‘physics’: 0.85451591014862061, ‘theology’: 0.051657050848007202}

As renormalization was a strong word in the training data, it gives an easy result. CNN can distinguish much more clearly.

Example 2: “salvation”

  • Average: {‘mathematics’: 0.14939650156482298, ‘physics’: 0.21692765541184023, ‘theology’: 0.5698233329716329}
  • CNN: {‘mathematics’: 0.012395491823554039, ‘physics’: 0.022725773975253105, ‘theology’: 0.96487873792648315}

“Salvation” is not found in the training data, but it is closely related to “soteriology,” which means the doctrine of salvation. So it correctly identifies it with theology.

Example 3: “coffee”

  • Average: {‘mathematics’: 0.096820211601723272, ‘physics’: 0.081567332119268032, ‘theology’: 0.15962682945135631}
  • CNN: {‘mathematics’: 0.27321341633796692, ‘physics’: 0.1950736939907074, ‘theology’: 0.53171288967132568}

Coffee is not related to all subjects. The first architecture correctly indicates the fact, but CNN, with its probabilistic nature, has to roughly equally distribute it (but not so well.)

The code can be found in my Github repository: stephenhky/PyShortTextCategorization. (This repository has been updated since this article was published. The link shows the version of the code when this appeared online.)

Continue reading “Short Text Categorization using Deep Neural Networks and Word-Embedding Models”

Toying with Word2Vec

One fascinating application of deep learning is the training of a model that outputs vectors representing words. A project written in Google, named Word2Vec, is one of the best tools regarding this. The vector representation captures the word contexts and relationships among words. This tool has been changing the landscape of natural language processing (NLP).

Let’s have some demonstration. To use Word2Vec in Python, you need to have the package gensim installed. (Installation instruction: here) And you have to download a trained model (GoogleNews-vectors-negative300.bin.gz), which is 3.6 GB big!! When you get into a Python shell (e.g., IPython), type

from gensim.models.word2vec import Word2Vec
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

This model enables the user to extract vector representation of length 300 of an English word. So what is so special about this vector representation from the traditional bag-of-words representation? First, the representation is standard. Once trained, we can use it in future training or testing dataset. Second, it captures the context of the word in a way that the algebraic operation of these vectors has meanings.

Here I give 5 examples.

A Juvenile Cat

What is a juvenile cat? We know that a juvenile dog is a puppy. Then we can get it by carry out the algebraic calculation \text{puppy} - \text{dog} + \text{cat} by running

model.most_similar(positive=['puppy', 'cat'], negative=['dog'], topn=5)

This outputs:

[(u'kitten', 0.7634989619255066),
(u'puppies', 0.7110899686813354),
(u'pup', 0.6929495334625244),
(u'kittens', 0.6888389587402344),
(u'cats', 0.6796488761901855)]

which indicates that “kitten” is the answer (correctly!) The numbers are similarities of these words with the vector representation  \text{puppy} - \text{dog} + \text{cat} in descending order. You can verify it by calculating the cosine distance:

from scipy.spatial import distance
print (1-distance.cosine(model['kitten'], model['puppy']+model['cat']-model['dog']))

which outputs 0.763498957413.

Mogu, my cat, three years ago when she was still a kitten
Mogu, my cat, three years ago when she was still a kitten

This demonstration shows that in the model, \text{puppy}-\text{dog} and \text{kitten}-\text{cat} are of similar semantic relations.

Capital of Taiwan

Where is the capital of Taiwan? We can find it if we know the capital of another country. For example, we know that Beijing is the capital of China. Then we can run the following:

model.most_similar(positive=['Beijing', 'Taiwan'], negative=['China'], topn=5)

which outputs

[(u'Taipei', 0.7866502404212952),
(u'Taiwanese', 0.6805002093315125),
(u'Kaohsiung', 0.6034111976623535),
(u'Chen', 0.5905819535255432),
(u'Seoul', 0.5865181684494019)]

Obviously, the answer is “Taipei.” And interestingly, the model sees Taiwan in the same footing of China!

Taipei (taken from Airasia: http://www.airasia.com/mo/en/destinations/taipei.page)

Past Participle of “eat”

We can extract grammatical information too. We know that the past participle of “go” is “gone”. With this, we can find that of “eat” by running:

model.most_similar(positive=[‘gone’, ‘eat’], negative=[‘go’], topn=5)

which outputs:

[(u'eaten', 0.7462186217308044),
(u'eating', 0.6516293287277222),
(u'ate', 0.6457351446151733),
(u'overeaten', 0.5853317975997925),
(u'eats', 0.5830586552619934)]

Capital of the State of Maryland

However, this model does not always work. If it can find the capital of Taiwan, can it find those for any states in the United States? We know that the capital of California is Sacramento. How about Maryland? Let’s run:

model.most_similar(positive=['Sacramento', 'Maryland'], negative=['California'], topn=5)

which sadly outputs:

[(u'Towson', 0.7032245397567749),
(u'Baltimore', 0.6951349973678589),
(u'Hagerstown', 0.6367553472518921),
(u'Anne_Arundel', 0.5931429266929626),
(u'Oxon_Hill', 0.5879474878311157)]

But the correct answer should be Annapolis!

Downtown Annapolis (taken from Wikipedia)

Blue crabs (lunch in Cantler's Riverside Inn, Annapolis, MD)
Blue crabs (lunch in Cantler’s Riverside Inn, Annapolis, MD)

More About Word2Vec

Word2Vec was developed by Tomáš Mikolov. He previously worked for Microsoft Research. However, he switched to Google, and published a few influential works on Word2Vec. [Mikolov, Yih, Zweig 2013] [Mikolov, Sutskever, Chen, Corrado, Dean 2013] [Mikolov, Chen, Corrado, Dean 2013] Their conference paper in 2013 can be found on arXiv. He later published a follow-up work on a package called Doc2Vec that considers phrases. [Le, Mikolov 2014]

Earlier this year, I listened to a talk in DCNLP meetup spoken by Michael Czerny on his award-winning blog entry titled “Modern Methods for Sentiment Analysis.” He applied the vector representations of words by Word2Vec to perform sentiment analysis, assuming that similar sentiments cluster together in the vector space. (He took averages of the vectors in tweets to extract emotions.) [Czerny 2015] I highly recommend you to read his blog entry. On the other hand, Xin Rong wrote an explanation about how Word2Vec works too. [Rong 2014]

There seems to be no progress on the project Word2Vec anymore as Tomáš Mikolov no longer works in Google. However, the Stanford NLP Group recognized that Word2Vec captures the relations between words in their vector representation. They worked on a similar project, called GloVe (Global Vectors), which tackles the problem with matrix factorization. [Pennington, Socher, Manning 2014] Radim Řehůřek did some analysis comparing Word2Vec and GloVe. [Řehůřek 2014] GloVe can be implemented in Python too.

Continue reading “Toying with Word2Vec”

Create a free website or blog at WordPress.com.

Up ↑