Data Representation in Machine Learning

In implementing most of the machine learning algorithms, we represent each data point with a feature vector as the input. A vector is basically an array of numerics, or in physics, an object with magnitude and direction. How do we represent our business data in terms of a vector?

Primitive Feature Vector

Whether the data are measured observations, or images (pixels), free text, factors, or shapes, they can be categorized into four following types:

  1. Categorical data
  2. Binary data
  3. Numerical data
  4. Graphical data

The most primitive representation of a feature vector looks like this:

Screen Shot 2019-09-15 at 3.58.09 PM
A typical feature vector. (Source:

Numerical Data

Numerical data can be represented as individual elements above (like Tweet GRU, Query GRU), and I am not going to talk too much about it.

Categorical Data

However, for categorical data, how do we represent them? The first basic way is to use one-hot encoding:

Screen Shot 2019-09-15 at 4.02.51 PM
One-hot encoding of categorical data (Source:

For each type of categorical data, each category has an integer code. In the figure above, each color has a code (0 for red, 1 for orange etc.) and they will eventually be transformed to the feature vector on the right, with vector length being the total number of categories found in the data, and the element will be filled with 1 if it is of that category. This allows a natural way of dealing with missing data (with all elements 0) and multi-category (with multiple non-zeros).

In natural language processing, the bag-of-words model is often used to represent free-text data, which is the one-hot encoding above with words as the categories. It is a good way as long as the order of the words does not matter.

Binary Data

For binary data, it can be easily represented by one element, either 1 or 0.

Graphical Data

Graphical data are best represented in terms of graph Laplacian and adjacency matrix. Refer to a previous blog article for more information.


A feature vector can be a concatenation of various features in terms of all these types except graphical data.

However, such representation that concatenates all the categorical, binary, and numerical fields has a lot of shortcomings:

  1. Data with different categories are often seen as orthogonal, i.e., perfectly dissimilar.  It ignores the correlation between different variables. However, it is a very big assumption.
  2. The weights of different fields are not considered.
  3. Sometimes if the numerical values are very large, it outweighs other categorical data in terms of influence in computation.
  4. Data are very sparse, costing a lot of memory waste and computing time.
  5. It is unknown whether some of the data are irrelevant.

Modifying Feature Vectors

In light of the shortcomings, to modify the feature factors, there are three main ways of dealing with this:

  1. Rescaling: rescaling all of some of the elements, or reweighing, to adjust the influence from different variables.
  2. Embedding: condensing the information into vectors of smaller lengths.
  3. Sparse coding: deliberately extend the vectors to a larger length.


Rescaling means rescaling all or some of the elements in the vectors. Usually there are two ways:

  1. Normalization: normalizing all the categories of one feature to having the sum of 1.
  2. Term frequency-inverse document frequency (tf-idf): weighing the elements so that the weights are heavier if the frequency is higher and it appears in relatively few documents or class labels.


Embedding means condensing a sparse vector to a smaller vector. Many sparse elements disappear and information is encoded inside the elements. There are rich amount of work on this.

  1. Topic models: finding the topic models (latent Dirichlet allocation (LDA),  structural topic models (STM) etc.) and encode the vectors with topics instead;
  2. Global dimensionality reduction algorithms: reducing the dimensions by retaining the principal components of the vectors of all the data, e.g., principal component analysis (PCA), independent component analysis (ICA), multi-dimensional scaling (MDS) etc;
  3. Local dimensionality reduction algorithms: same as the global, but these are good for finding local patterns, where examples include t-Distributed Stochastic Neighbor Embedding (tSNE) and Uniform Manifold Approximation and Projection (UMAP);
  4. Representation learned from deep neural networks: embeddings learned from encoding using neural networks, such as auto-encoders, Word2Vec, FastText, BERT etc.
  5. Mixture Models: Gaussian mixture models (GMM), Dirichlet multinomial mixture (DMM) etc.
  6. Others: Tensor decomposition (Schmidt decomposition, Jennrich algorithm etc.), GloVe etc.

Sparse Coding

Sparse coding is good for finding basis vectors for dense vectors.

Continue reading “Data Representation in Machine Learning”

Neural-Network Representation of Quantum Many-Body States

There are many embeddings algorithm for representations. Sammon embedding is the oldest one, and we have Word2Vec, GloVe, FastText etc. for word-embedding algorithms. Embeddings are useful for dimensionality reduction.

Traditionally, quantum many-body states are represented by Fock states, which is useful when the excitations of quasi-particles are the concern. But to capture the quantum entanglement between many solitons or particles in a statistical systems, it is important not to lose the topological correlation between the states. It has been known that restricted Boltzmann machines (RBM) have been used to represent such states, but it has its limitation, which Xun Gao and Lu-Ming Duan have stated in their article published in Nature Communications:

There exist states, which can be generated by a constant-depth quantum circuit or expressed as PEPS (projected entangled pair states) or ground states of gapped Hamiltonians, but cannot be efficiently represented by any RBM unless the polynomial hierarchy collapses in the computational complexity theory.

PEPS is a generalization of matrix product states (MPS) to higher dimensions. (See this.)

However, Gao and Duan were able to prove that deep Boltzmann machine (DBM) can bridge the loophole of RBM, as stated in their article:

Any quantum state of n qubits generated by a quantum circuit of depth T can be represented exactly by a sparse DBM with O(nT) neurons.


(diagram adapted from Gao and Duan’s article)

Continue reading “Neural-Network Representation of Quantum Many-Body States”

Sammon Embedding with Tensorflow

Embedding algorithms, especially word-embedding algorithms, have been one of the recurrent themes of this blog. Word2Vec has been mentioned in a few entries (see this); LDA2Vec has been covered (see this); the mathematical principle of GloVe has been elaborated (see this); I haven’t even covered Facebook’s fasttext; and I have not explained the widely used t-SNE and Kohonen’s map (self-organizing map, SOM).

I have also described the algorithm of Sammon Embedding, (see this) which attempts to capture the likeliness of pairwise Euclidean distances, and I implemented it using Theano. This blog entry is about its implementation in Tensorflow as a demonstration.

Let’s recall the formalism of Sammon Embedding, as outlined in the previous entry:

Assume there are high dimensional data described by d-dimensional vectors, X_i where i=1, 2, \ldots, N. And they will be mapped into vectors Y_i, with dimensions 2 or 3. Denote the distances to be d_{ij}^{*} = \sqrt{| X_i - X_j|^2} and d_{ij} = \sqrt{| Y_i - Y_j|^2}. In this problem, Y_i are the variables to be learned. The cost function to minimize is

E = \frac{1}{c} \sum_{i<j} \frac{(d_{ij}^{*} - d_{ij})^2}{d_{ij}^{*}},

where c = \sum_{i<j} d_{ij}^{*}.

Unlike in previous entry and original paper, I am going to optimize it using first-order gradient optimizer. If you are not familiar with Tensorflow, take a look at some online articles, for example, “Tensorflow demystified.” This demonstration can be found in this Jupyter Notebook in Github.

First of all, import all the libraries required:

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

Like previously, we want to use the points clustered around at the four nodes of a tetrahedron as an illustration, which is expected to give equidistant clusters. We sample points around them, as shown:

tetrahedron_points = [np.array([0., 0., 0.]), np.array([1., 0., 0.]), np.array([np.cos(np.pi/3), np.sin(np.pi/3), 0.]), np.array([0.5, 0.5/np.sqrt(3), np.sqrt(2./3.)])]

sampled_points = np.concatenate([np.random.multivariate_normal(point, np.eye(3)*0.0001, 10) for point in tetrahedron_points])

init_points = np.concatenate([np.random.multivariate_normal(point[:2], np.eye(2)*0.0001, 10) for point in tetrahedron_points])

Retrieve the number of points, N, and the resulting dimension, d:

N = sampled_points.shape[0]
d = sampled_points.shape[1]

One of the most challenging technical difficulties is to calculate the pairwise distance. Inspired by this StackOverflow thread and Travis Hoppe’s entry on Thomson’s problem, we know it can be computed. Assuming Einstein’s convention of summation over repeated indices, given vectors a_{ik}, the distance matrix is:

D_{ij} = (a_{ik}-a_{jk}) (a_{ik} - a_{jk})^T = a_{ik} a_{ik} + a_{jk} a_{jk} - 2 a_{ik} a_{jk},

where the first and last terms are simply the norms of the vectors. After computing the matrix, we will flatten it to vectors, for technical reasons omitted to avoid gradient overflow:

X = tf.placeholder('float')
Xshape = tf.shape(X)

sqX = tf.reduce_sum(X*X, 1)
sqX = tf.reshape(sqX, [-1, 1])
sqDX = sqX - 2*tf.matmul(X, tf.transpose(X)) + tf.transpose(sqX)
sqDXarray = tf.stack([sqDX[i, j] for i in range(N) for j in range(i+1, N)])
DXarray = tf.sqrt(sqDXarray)

Y = tf.Variable(init_points, dtype='float')
sqY = tf.reduce_sum(Y*Y, 1)
sqY = tf.reshape(sqY, [-1, 1])
sqDY = sqY - 2*tf.matmul(Y, tf.transpose(Y)) + tf.transpose(sqY)
sqDYarray = tf.stack([sqDY[i, j] for i in range(N) for j in range(i+1, N)])
DYarray = tf.sqrt(sqDYarray)

And DXarray and DYarray are the vectorized pairwise distances. Then we defined the cost function according to the definition:

Z = tf.reduce_sum(DXarray)*0.5
numerator = tf.reduce_sum(tf.divide(tf.square(DXarray-DYarray), DXarray))*0.5
cost = tf.divide(numerator, Z)

As we said, we used first-order gradient optimizers. For unknown reasons, the usually well-performing Adam optimizer gives overflow. I then picked Adagrad:

update_rule = tf.assign(Y, Y-0.01*grad_cost/lapl_cost)
train = tf.train.AdamOptimizer(0.01).minimize(cost)
init = tf.global_variables_initializer()

The last line initializes all variables in the Tensorflow session when it is run. Then start a Tensorflow session, and initialize all variables globally:

sess = tf.Session()

Then run the algorithm:

nbsteps = 1000
c =, feed_dict={X: sampled_points})
print "epoch: ", -1, " cost = ", c
for i in range(nbsteps):, feed_dict={X: sampled_points})
    c =, feed_dict={X: sampled_points})
    print "epoch: ", i, " cost =

Then extract the points and close the Tensorflow session:

calculated_Y =, feed_dict={X: sampled_points})

Plot it using matplotlib:

embed1, embed2 = calculated_Y.transpose()
plt.plot(embed1, embed2, 'ro')

This gives, as expected,

download (5)

This code for Sammon Embedding has been incorporated into the Python package mogu, which is a collection of numerical routines. You can install it, and call:

from mogu.embed import sammon_embedding
calculated_Y = sammon_embedding(sampled_points, init_points)

Continue reading “Sammon Embedding with Tensorflow”

Short Text Categorization using Deep Neural Networks and Word-Embedding Models

There are situations that we deal with short text, probably messy, without a lot of training data. In that case, we need external semantic information. Instead of using the conventional bag-of-words (BOW) model, we should employ word-embedding models, such as Word2Vec, GloVe etc.

Suppose we want to perform supervised learning, with three subjects, described by the following Python dictionary:

classdict={'mathematics': ['linear algebra',
           'variational calculus',
           'functional field',
           'real analysis',
           'complex analysis',
           'differential equation',
           'statistical optimization',
           'stochastic calculus',
           'numerical analysis',
           'differential geometry'],
          'physics': ['renormalization',
           'classical mechanics',
           'quantum mechanics',
           'statistical mechanics',
           'functional field',
           'path integral',
           'quantum field theory',
           'condensed matter',
           'particle physics',
           'topological solitons',
           'spontaneous symmetry breaking',
           'atomic molecular and optical physics',
           'quantum chaos'],
          'theology': ['divine providence',
           'Holy Trinity',
           'divine degree',
           'creedal confessionalism',

And we implemented Word2Vec here. To add external information, we use a pre-trained Word2Vec model from Google, downloaded here. We can use it with Python package gensim. To load it, enter

from gensim.models import Word2Vec
wvmodel = Word2Vec.load_word2vec_format('<path-to>/GoogleNews-vectors-negative300.bin.gz', binary=True)

How do we represent a phrase in Word2Vec? How do we do the classification? Here I wrote two classes to do it.


We can represent a sentence by summing the word-embedding representations of each word. The class, inside, is coded as follow:

from collections import defaultdict

import numpy as np
from nltk import word_tokenize
from scipy.spatial.distance import cosine

from utils import ModelNotTrainedException

class SumEmbeddedVecClassifier:
    def __init__(self, wvmodel, classdict, vecsize=300):
        self.wvmodel = wvmodel
        self.classdict = classdict
        self.vecsize = vecsize
        self.trained = False

    def train(self):
        self.addvec = defaultdict(lambda : np.zeros(self.vecsize))
        for classtype in self.classdict:
            for shorttext in self.classdict[classtype]:
                self.addvec[classtype] += self.shorttext_to_embedvec(shorttext)
            self.addvec[classtype] /= np.linalg.norm(self.addvec[classtype])
        self.addvec = dict(self.addvec)
        self.trained = True

    def shorttext_to_embedvec(self, shorttext):
        vec = np.zeros(self.vecsize)
        tokens = word_tokenize(shorttext)
        for token in tokens:
            if token in self.wvmodel:
                vec += self.wvmodel[token]
        norm = np.linalg.norm(vec)
        if norm!=0:
            vec /= np.linalg.norm(vec)
        return vec

    def score(self, shorttext):
        if not self.trained:
            raise ModelNotTrainedException()
        vec = self.shorttext_to_embedvec(shorttext)
        scoredict = {}
        for classtype in self.addvec:
                scoredict[classtype] = 1 - cosine(vec, self.addvec[classtype])
            except ValueError:
                scoredict[classtype] = np.nan
        return scoredict

Here the exception ModelNotTrainedException is just an exception raised if the model has not been trained yet, but scoring function was called by the user. (Codes listed in my Github repository.) The similarity will be calculated by cosine similarity.

Such an implementation is easy to understand and carry out. It is good enough for a lot of application. However, it has the problem that it does not take the relation between words or word order into account.

Convolutional Neural Network

To tackle the problem of word relations, we have to use deeper neural networks. Yoon Kim published a well cited paper regarding this in EMNLP in 2014, titled “Convolutional Neural Networks for Sentence Classification.” The model architecture is as follow: (taken from his paper)


Each word is represented by an embedded vector, but neighboring words are related through the convolutional matrix. And MaxPooling and a dense neural network were implemented afterwards. His paper involves multiple filters with variable window sizes / spatial extent, but for our cases of short phrases, I just use one window of size 2 (similar to dealing with bigram). While Kim implemented using Theano (see his Github repository), I implemented using keras with Theano backend. The codes, inside, are as follow:

import numpy as np
from keras.layers import Convolution1D, MaxPooling1D, Flatten, Dense
from keras.models import Sequential
from nltk import word_tokenize

from utils import ModelNotTrainedException

class CNNEmbeddedVecClassifier:
    def __init__(self,
        self.wvmodel = wvmodel
        self.classdict = classdict
        self.n_gram = n_gram
        self.vecsize = vecsize
        self.nb_filters = nb_filters
        self.maxlen = maxlen
        self.trained = False

    def convert_trainingdata_matrix(self):
        classlabels = self.classdict.keys()
        lblidx_dict = dict(zip(classlabels, range(len(classlabels))))

        # tokenize the words, and determine the word length
        phrases = []
        indices = []
        for label in classlabels:
            for shorttext in self.classdict[label]:
                category_bucket = [0]*len(classlabels)
                category_bucket[lblidx_dict[label]] = 1

        # store embedded vectors
        train_embedvec = np.zeros(shape=(len(phrases), self.maxlen, self.vecsize))
        for i in range(len(phrases)):
            for j in range(min(self.maxlen, len(phrases[i]))):
                train_embedvec[i, j] = self.word_to_embedvec(phrases[i][j])
        indices = np.array(indices,

        return classlabels, train_embedvec, indices

    def train(self):
        # convert classdict to training input vectors
        self.classlabels, train_embedvec, indices = self.convert_trainingdata_matrix()

        # build the deep neural network model
        model = Sequential()
                                input_shape=(self.maxlen, self.vecsize)))
        model.add(Dense(len(self.classlabels), activation='softmax'))
        model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

        # train the model, indices)

        # flag switch
        self.model = model
        self.trained = True

    def word_to_embedvec(self, word):
        return self.wvmodel[word] if word in self.wvmodel else np.zeros(self.vecsize)

    def shorttext_to_matrix(self, shorttext):
        tokens = word_tokenize(shorttext)
        matrix = np.zeros((self.maxlen, self.vecsize))
        for i in range(min(self.maxlen, len(tokens))):
            matrix[i] = self.word_to_embedvec(tokens[i])
        return matrix

    def score(self, shorttext):
        if not self.trained:
            raise ModelNotTrainedException()

        # retrieve vector
        matrix = np.array([self.shorttext_to_matrix(shorttext)])

        # classification using the neural network
        predictions = self.model.predict(matrix)

        # wrangle output result
        scoredict = {}
        for idx, classlabel in zip(range(len(self.classlabels)), self.classlabels):
            scoredict[classlabel] = predictions[0][idx]
        return scoredict

The output is a vector of length equal to the number of class labels, 3 in our example. The elements of the output vector add up to one, indicating its score, and a nature of probability.


A simple cross-validation to the example data set does not tell a difference between the two algorithms:


However, we can test the algorithm with a few examples:

Example 1: “renormalization”

  • Average: {‘mathematics’: 0.54135105096749336, ‘physics’: 0.63665460856632494, ‘theology’: 0.31014049736087901}
  • CNN: {‘mathematics’: 0.093827009201049805, ‘physics’: 0.85451591014862061, ‘theology’: 0.051657050848007202}

As renormalization was a strong word in the training data, it gives an easy result. CNN can distinguish much more clearly.

Example 2: “salvation”

  • Average: {‘mathematics’: 0.14939650156482298, ‘physics’: 0.21692765541184023, ‘theology’: 0.5698233329716329}
  • CNN: {‘mathematics’: 0.012395491823554039, ‘physics’: 0.022725773975253105, ‘theology’: 0.96487873792648315}

“Salvation” is not found in the training data, but it is closely related to “soteriology,” which means the doctrine of salvation. So it correctly identifies it with theology.

Example 3: “coffee”

  • Average: {‘mathematics’: 0.096820211601723272, ‘physics’: 0.081567332119268032, ‘theology’: 0.15962682945135631}
  • CNN: {‘mathematics’: 0.27321341633796692, ‘physics’: 0.1950736939907074, ‘theology’: 0.53171288967132568}

Coffee is not related to all subjects. The first architecture correctly indicates the fact, but CNN, with its probabilistic nature, has to roughly equally distribute it (but not so well.)

The code can be found in my Github repository: stephenhky/PyShortTextCategorization. (This repository has been updated since this article was published. The link shows the version of the code when this appeared online.)

Continue reading “Short Text Categorization using Deep Neural Networks and Word-Embedding Models”

Probabilistic Theory of Word Embeddings: GloVe

The topic of word embedding algorithms has been one of the interests of this blog, as in this entry, with Word2Vec [Mikilov et. al. 2013] as one of the main examples. It is a great tool for text mining, (for example, see [Czerny 2015],) as it reduces the dimensions needed (compared to bag-of-words model). As an algorithm borrowed from computer vision, a lot of these algorithms use deep learning methods to train the model, while it was not exactly sure why it works. Despite that, there are many articles talking about how to train the model. [Goldberg & Levy 2014, Rong 2014 etc.] Addition and subtraction of the word vectors show amazing relationships that carry semantic values, as I have shown in my previous blog entry. [Ho 2015]

However, Tomas Mikolov is no longer working in Google, making the development of this algorithm discontinued. As a follow-up of their work, Stanford NLP group later proposed a model, called GloVe (Global Vectors), that embeds word vectors using probabilistic methods. [Pennington, Socher & Manning 2014] It can be implemented in the package glove-python in python, and text2vec in R (or see their CRAN post).  Their paper is neatly written, and a highly recommended read.

To explain the theory of GloVe, we start with some basic probabilistic picture in basic natural language processing (NLP). We suppose the relation between the words occur in certain text windows within a corpus, but the details are not important here. Assume that i, j, and k are three words, and the conditional probability P_{ik} is defined as

P_{ij} = P(j | i) = \frac{X_{ij}}{X_i},

where X‘s are the counts, and similarly for P_{jk}. And we are interested in the following ratio:

F(w_i, w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}.

The tilde means “context,” but we will later assume it is also a word. Citing the example from their paper, take i as ice, and j as steam. if k is solid, then the ratio is expected to be large; or if k is gas, then it is expected to be low. But if k is water, which are related to both, or fashion, which is related to none, then the ratio is expected to be approximately 1.

And the addition and subtraction in Word2Vec is similar to this. We want the ratio to be like the subtraction as in Word2Vec (and multiplication as in addition), then we should modify the function F such that,

F(w_i - w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}.

On the other hand, the input arguments of F are vectors, but the output is a scalar. We avoid the issue by making the input argument as a dot product,

F( (w_i - w_j)^T \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}.

In NLP, the word-word co-occurrence matrices are symmetric, and our function F should also be invariant under switching the labeling. We first require F is be a homomorphism,

F((w_i - w_j)^T \tilde{w}_k) = \frac{F(w_i^T \tilde{w}_k) }{ F(w_j^T \tilde{w}_k)},

where we define,

F(w_i^T \tilde{w}_k) = P_{ik} = \frac{X_{ik}}{X_i}.

It is clear that F is an exponential function, but to ensure symmetry, we further define:

w_i^T \tilde{w}_k + b_i + \tilde{b}_k = \log X_{ik}.

As a result of this equation, the authors defined the following cost function to optimize for GloVe model:

J = \sum_{i, j=1}^V f(X_{ij}) \left( w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ik} \right)^2,

where w_j, \tilde{w}_j, b_i, and \tilde{b}_j are parameters to learn. f(x) is a weighting function. (Refer the details to the paper.) [Pennington, Socher & Manning 2014]

As Radim Řehůřek said in his blog entry, [Řehůřek 2014] it is a neat paper, but their evaluation is crappy.

This theory explained why certain similar relations can be achieved, such as Paris – France is roughly equal to Beijing – China, as both can be transformed to the ratio in the definition of F above.

It is a neat paper, as it employs optimization theory and probability theory, without any dark box deep learning.

Continue reading “Probabilistic Theory of Word Embeddings: GloVe”

textmineR: a New Text Mining Package for R

Previously, I wrote an entry on text mining on R and Python, and did a comparison. However, the text mining package employed was tm for R. But it has some problems:

  1. The syntax is not natural for an experienced R users.
  2. tm uses simple_triplet_matrix from the slam library for document-term matrix (DTM) and term-occurrence matrix (TCM), which is not as widely used as dgCMatrix from the Matrix library.

Tommy Jones, a Ph.D. student in George Mason University, and a data scientist at Impact Research, developed an alternative text mining package called textmineR. He presented in a Stat Prog DC Meetup on April 27, 2016. It employed a better syntax, and dgCMatrix. All in all, it is a wrapper for a lot of existing R packages to facilitate the text mining process, like creating DTM matrices with stopwords or appropriate stemming/lemmatizing functions. Here is a sample code to create a DTM with the example from the previous entry:


texts <- c('I love Python.',
           'R is good for analytics.',
           'Mathematics is fun.')

               doc_names = c(1:length(texts)),
               ngram_window = c(1, 1),
               stopword_vec = c(tm::stopwords('english'), tm::stopwords('SMART')),
               lower = TRUE,
               remove_punctuation = TRUE,
               remove_numbers = TRUE

The DTM is a sparse matrix:

3 x 6 sparse Matrix of class &amp;quot;dgCMatrix&amp;quot;
  analytics fun mathematics good python love
1         .   .           .    .      1    1
2         1   .           .    1      .    .
3         .   1           1    .      .    .

On the other hand, it wraps text2vec, an R package that wraps the word-embedding algorithm named gloVe. And it wraps a number of topic modeling algorithms, such as latent Dirichlet allocation (LDA) and correlated topic models (CTM).

In addition, it contains a parallel computing loop function called TmParallelApply, analogous to the original R parallel loop function mclapply, but TmParallelApply works on Windows as well.

textmineR is an open-source project, with source code available on github, which contains his example codes.

Continue reading “textmineR: a New Text Mining Package for R”

Word Embedding Algorithms

Embedding has been hot in recent years partly due to the success of Word2Vec, (see demo in my previous entry) although the idea has been around in academia for more than a decade. The idea is to transform a vector of integers into continuous, or embedded, representations. Keras, a Python package that implements neural network models (including the ANN, RNN, CNN etc.) by wrapping Theano or TensorFlow, implemented it, as shown in the example below (which converts a vector of 200 features into a continuous vector of 10):

from keras.layers import Embedding
from keras.models import Sequential

# define and compile the embedding model
model = Sequential()
model.add(Embedding(200, 10, input_length=1))
model.compile('rmsprop', 'mse')  # optimizer: rmsprop; loss function: mean-squared error

We can then convert any features from 0 to 199 into vectors of 20, as shown below:

import numpy as np

model.predict(np.array([10, 90, 151]))

It outputs:

array([[[ 0.02915354,  0.03084954, -0.04160764, -0.01752155, -0.00056815,
         -0.02512387, -0.02073313, -0.01154278, -0.00389587, -0.04596512]],

       [[ 0.02981793, -0.02618774,  0.04137352, -0.04249889,  0.00456919,
          0.04393572,  0.04139435,  0.04415271,  0.02636364, -0.04997493]],

       [[ 0.00947296, -0.01643104, -0.03241419, -0.01145032,  0.03437041,
          0.00386361, -0.03124221, -0.03837727, -0.04804075, -0.01442516]]])

Of course, one must not omit a similar algorithm called GloVe, developed by the Stanford NLP group. Their codes have been wrapped in both Python (package called glove) and R (library called text2vec).

Besides Word2Vec, there are other word embedding algorithms that try to complement Word2Vec, although many of them are more computationally costly. Previously, I introduced LDA2Vec in my previous entry, an algorithm that combines the locality of words and their global distribution in the corpus. And in fact, word embedding algorithms with a similar ideas are also invented by other scientists, as I have introduced in another entry.

However, there are word embedding algorithms coming out. Since most English words carry more than a single sense, different senses of a word might be best represented by different embedded vectors. Incorporating word sense disambiguation, a method called sense2vec has been introduced by Trask, Michalak, and Liu. (arXiv:1511.06388). Matthew Honnibal wrote a nice blog entry demonstrating its use.

There are also other related work, such as wang2vec that is more sensitive to word orders.

Big Bang Theory (Season 2, Episode 5): Euclid Alternative

DMV staff: Application?
Sheldon: I’m actually more or a theorist.

Note: feature image taken from Big Bang Theory (CBS).

Continue reading “Word Embedding Algorithms”

Local and Global: Words and Topics

Word2Vec has hit the NLP world for a while, as it is a nice method for word embeddings or word representations. Its use of skip-gram model and deep learning made a big impact too. It has been my favorite toy indeed. However, even though the words do have a correlation across a small segment of text, it is still a local coherence. On the other hand, topic models such as latent Dirichlet allocation (LDA) capture the distribution of words within a topic, and that of topics within a document etc. And it provides a representation of a new document in terms of a topic.

In my previous blog entry, I introduced Chris Moody’s LDA2Vec algorithm (see: his SlideShare). Unfortunately, not many papers or blogs have covered this new algorithm too much despite its potential. The API is not completely well documented yet, although you can see its example from its source code on its Github. In its documentation, it gives an example of deriving topics from an array of random numbers, in its lda2vec/ code:

from lda2vec import LDA2Vec
n_words = 10
n_docs = 15
n_hidden = 8
n_topics = 2
n_obs = 300
words = np.random.randint(n_words, size=(n_obs))
_, counts = np.unique(words, return_counts=True)
model = LDA2Vec(n_words, n_hidden, counts)
model.add_categorical_feature(n_docs, n_topics, name='document id')
doc_ids = np.arange(n_obs) % n_docs
loss = model.fit_partial(words, 1.0, categorical_features=doc_ids)

A more comprehensive example is in examples/twenty_newsgroup/ .

Besides, LDA2Vec, there are some related research work on topical word embeddings too. A group of Australian and American scientists studied about the topic modeling with pre-trained Word2Vec (or GloVe) before performing LDA. (See: their paper and code) On the other hand, another group with Chinese and Singaporean scientists performs LDA, and then trains a Word2Vec model. (See: their paper and code) And LDA2Vec concatenates the Word2Vec and LDA representation, like an early fusion.

No matter what, representations with LDA models (or related topic modeling such as correlated topic models (CTM)) can be useful even outside NLP. I have found it useful at some intermediate layer of calculation lately.

Graphical Representations of LDA model. Source:

Continue reading “Local and Global: Words and Topics”

Toying with Word2Vec

One fascinating application of deep learning is the training of a model that outputs vectors representing words. A project written in Google, named Word2Vec, is one of the best tools regarding this. The vector representation captures the word contexts and relationships among words. This tool has been changing the landscape of natural language processing (NLP).

Let’s have some demonstration. To use Word2Vec in Python, you need to have the package gensim installed. (Installation instruction: here) And you have to download a trained model (GoogleNews-vectors-negative300.bin.gz), which is 3.6 GB big!! When you get into a Python shell (e.g., IPython), type

from gensim.models.word2vec import Word2Vec
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

This model enables the user to extract vector representation of length 300 of an English word. So what is so special about this vector representation from the traditional bag-of-words representation? First, the representation is standard. Once trained, we can use it in future training or testing dataset. Second, it captures the context of the word in a way that the algebraic operation of these vectors has meanings.

Here I give 5 examples.

A Juvenile Cat

What is a juvenile cat? We know that a juvenile dog is a puppy. Then we can get it by carry out the algebraic calculation \text{puppy} - \text{dog} + \text{cat} by running

model.most_similar(positive=['puppy', 'cat'], negative=['dog'], topn=5)

This outputs:

[(u'kitten', 0.7634989619255066),
(u'puppies', 0.7110899686813354),
(u'pup', 0.6929495334625244),
(u'kittens', 0.6888389587402344),
(u'cats', 0.6796488761901855)]

which indicates that “kitten” is the answer (correctly!) The numbers are similarities of these words with the vector representation  \text{puppy} - \text{dog} + \text{cat} in descending order. You can verify it by calculating the cosine distance:

from scipy.spatial import distance
print (1-distance.cosine(model['kitten'], model['puppy']+model['cat']-model['dog']))

which outputs 0.763498957413.

Mogu, my cat, three years ago when she was still a kitten
Mogu, my cat, three years ago when she was still a kitten

This demonstration shows that in the model, \text{puppy}-\text{dog} and \text{kitten}-\text{cat} are of similar semantic relations.

Capital of Taiwan

Where is the capital of Taiwan? We can find it if we know the capital of another country. For example, we know that Beijing is the capital of China. Then we can run the following:

model.most_similar(positive=['Beijing', 'Taiwan'], negative=['China'], topn=5)

which outputs

[(u'Taipei', 0.7866502404212952),
(u'Taiwanese', 0.6805002093315125),
(u'Kaohsiung', 0.6034111976623535),
(u'Chen', 0.5905819535255432),
(u'Seoul', 0.5865181684494019)]

Obviously, the answer is “Taipei.” And interestingly, the model sees Taiwan in the same footing of China!

Taipei (taken from Airasia:

Past Participle of “eat”

We can extract grammatical information too. We know that the past participle of “go” is “gone”. With this, we can find that of “eat” by running:

model.most_similar(positive=[‘gone’, ‘eat’], negative=[‘go’], topn=5)

which outputs:

[(u'eaten', 0.7462186217308044),
(u'eating', 0.6516293287277222),
(u'ate', 0.6457351446151733),
(u'overeaten', 0.5853317975997925),
(u'eats', 0.5830586552619934)]

Capital of the State of Maryland

However, this model does not always work. If it can find the capital of Taiwan, can it find those for any states in the United States? We know that the capital of California is Sacramento. How about Maryland? Let’s run:

model.most_similar(positive=['Sacramento', 'Maryland'], negative=['California'], topn=5)

which sadly outputs:

[(u'Towson', 0.7032245397567749),
(u'Baltimore', 0.6951349973678589),
(u'Hagerstown', 0.6367553472518921),
(u'Anne_Arundel', 0.5931429266929626),
(u'Oxon_Hill', 0.5879474878311157)]

But the correct answer should be Annapolis!

Downtown Annapolis (taken from Wikipedia)

Blue crabs (lunch in Cantler's Riverside Inn, Annapolis, MD)
Blue crabs (lunch in Cantler’s Riverside Inn, Annapolis, MD)

More About Word2Vec

Word2Vec was developed by Tomáš Mikolov. He previously worked for Microsoft Research. However, he switched to Google, and published a few influential works on Word2Vec. [Mikolov, Yih, Zweig 2013] [Mikolov, Sutskever, Chen, Corrado, Dean 2013] [Mikolov, Chen, Corrado, Dean 2013] Their conference paper in 2013 can be found on arXiv. He later published a follow-up work on a package called Doc2Vec that considers phrases. [Le, Mikolov 2014]

Earlier this year, I listened to a talk in DCNLP meetup spoken by Michael Czerny on his award-winning blog entry titled “Modern Methods for Sentiment Analysis.” He applied the vector representations of words by Word2Vec to perform sentiment analysis, assuming that similar sentiments cluster together in the vector space. (He took averages of the vectors in tweets to extract emotions.) [Czerny 2015] I highly recommend you to read his blog entry. On the other hand, Xin Rong wrote an explanation about how Word2Vec works too. [Rong 2014]

There seems to be no progress on the project Word2Vec anymore as Tomáš Mikolov no longer works in Google. However, the Stanford NLP Group recognized that Word2Vec captures the relations between words in their vector representation. They worked on a similar project, called GloVe (Global Vectors), which tackles the problem with matrix factorization. [Pennington, Socher, Manning 2014] Radim Řehůřek did some analysis comparing Word2Vec and GloVe. [Řehůřek 2014] GloVe can be implemented in Python too.

Continue reading “Toying with Word2Vec”

Create a free website or blog at

Up ↑