In implementing most of the machine learning algorithms, we represent each data point with a feature vector as the input. A vector is basically an array of numerics, or in physics, an object with magnitude and direction. How do we represent our business data in terms of a vector?

# Primitive Feature Vector

Whether the data are measured observations, or images (pixels), free text, factors, or shapes, they can be categorized into four following types:

1. Categorical data
2. Binary data
3. Numerical data
4. Graphical data

The most primitive representation of a feature vector looks like this:

## Numerical Data

Numerical data can be represented as individual elements above (like Tweet GRU, Query GRU), and I am not going to talk too much about it.

## Categorical Data

However, for categorical data, how do we represent them? The first basic way is to use one-hot encoding:

For each type of categorical data, each category has an integer code. In the figure above, each color has a code (0 for red, 1 for orange etc.) and they will eventually be transformed to the feature vector on the right, with vector length being the total number of categories found in the data, and the element will be filled with 1 if it is of that category. This allows a natural way of dealing with missing data (with all elements 0) and multi-category (with multiple non-zeros).

In natural language processing, the bag-of-words model is often used to represent free-text data, which is the one-hot encoding above with words as the categories. It is a good way as long as the order of the words does not matter.

## Binary Data

For binary data, it can be easily represented by one element, either 1 or 0.

## Graphical Data

Graphical data are best represented in terms of graph Laplacian and adjacency matrix. Refer to a previous blog article for more information.

## Shortcomings

A feature vector can be a concatenation of various features in terms of all these types except graphical data.

However, such representation that concatenates all the categorical, binary, and numerical fields has a lot of shortcomings:

1. Data with different categories are often seen as orthogonal, i.e., perfectly dissimilar.  It ignores the correlation between different variables. However, it is a very big assumption.
2. The weights of different fields are not considered.
3. Sometimes if the numerical values are very large, it outweighs other categorical data in terms of influence in computation.
4. Data are very sparse, costing a lot of memory waste and computing time.
5. It is unknown whether some of the data are irrelevant.

# Modifying Feature Vectors

In light of the shortcomings, to modify the feature factors, there are three main ways of dealing with this:

1. Rescaling: rescaling all of some of the elements, or reweighing, to adjust the influence from different variables.
2. Embedding: condensing the information into vectors of smaller lengths.
3. Sparse coding: deliberately extend the vectors to a larger length.

## Rescaling

Rescaling means rescaling all or some of the elements in the vectors. Usually there are two ways:

1. Normalization: normalizing all the categories of one feature to having the sum of 1.
2. Term frequency-inverse document frequency (tf-idf): weighing the elements so that the weights are heavier if the frequency is higher and it appears in relatively few documents or class labels.

## Embedding

Embedding means condensing a sparse vector to a smaller vector. Many sparse elements disappear and information is encoded inside the elements. There are rich amount of work on this.

1. Topic models: finding the topic models (latent Dirichlet allocation (LDA),  structural topic models (STM) etc.) and encode the vectors with topics instead;
2. Global dimensionality reduction algorithms: reducing the dimensions by retaining the principal components of the vectors of all the data, e.g., principal component analysis (PCA), independent component analysis (ICA), multi-dimensional scaling (MDS) etc;
3. Local dimensionality reduction algorithms: same as the global, but these are good for finding local patterns, where examples include t-Distributed Stochastic Neighbor Embedding (tSNE) and Uniform Manifold Approximation and Projection (UMAP);
4. Representation learned from deep neural networks: embeddings learned from encoding using neural networks, such as auto-encoders, Word2Vec, FastText, BERT etc.
5. Mixture Models: Gaussian mixture models (GMM), Dirichlet multinomial mixture (DMM) etc.
6. Others: Tensor decomposition (Schmidt decomposition, Jennrich algorithm etc.), GloVe etc.

## Sparse Coding

Sparse coding is good for finding basis vectors for dense vectors.

Sebastian Ruder recently wrote an article on The Gradient and asserted that the oracle of natural language processing is emerging. While I am not sure such confident statement is overstated, I do look forward to the moment that we will download pre-trained embedded language models and transfer to our use cases, just like we are using pre-trained word-embedding models such as Word2Vec and FastText.

I do not think one can really draw a parallelism between computer vision and natural language processing. Computer vision is challenging, but natural language processing is even more difficult because the tasks regarding linguistics are not limited to object or meaning recognition, but also human psychology, cultures, and linguistic diversities. The objectives are far from being identical.

However, the transferrable use of embedded language models is definitely a big step forward. Ruder quoted three articles, which I would summarize below in a few words.

• Embeddings from Language Models (ELMo, arXiv:1802.05365): based on the successful bidirectional LSTM language models, the authors developed a deep contextualized embedded models by collapses all layers in the neural network architecture.
• Universal Language Model Fine-Tuning for Text Classification (ULMFiT, arXiv:1801.06146): the authors proposed a type of architectures that learn representations for specific tasks, which involve three steps in training: a) LM pre-training: learning through unlabeled corpus with abundant data; b) LM fine-tuning: learning through labeled corpus; and c) classifier fine-tuning: transferred training for specific classification tasks.
• OpenAI Transformer (article still in progress): the author proposed a simple generative language model with the three similar steps in ULMFit: a) unsupervised pre-training: training a language model that maximizes the likelihood of a sequence of tokens within a context window; b) supervised fine-tuning: a supervised classification training that maximizes the likelihood using the Bayesian approach; c) task-specific input transformations: training the classifiers on a specific task.

These three articles are intricately related to each other. Without abundant data and good hardware, it is almost impossible to produce the language models. As Ruder suggested, we will probably have a pre-trained model up to the second step of the ULMFit and OpenAI Transformer papers, but we train our own specific model for our use. We have been doing this for word-embedding models, and this approach has been common in computer vision too.

There are many embeddings algorithm for representations. Sammon embedding is the oldest one, and we have Word2Vec, GloVe, FastText etc. for word-embedding algorithms. Embeddings are useful for dimensionality reduction.

Traditionally, quantum many-body states are represented by Fock states, which is useful when the excitations of quasi-particles are the concern. But to capture the quantum entanglement between many solitons or particles in a statistical systems, it is important not to lose the topological correlation between the states. It has been known that restricted Boltzmann machines (RBM) have been used to represent such states, but it has its limitation, which Xun Gao and Lu-Ming Duan have stated in their article published in Nature Communications:

There exist states, which can be generated by a constant-depth quantum circuit or expressed as PEPS (projected entangled pair states) or ground states of gapped Hamiltonians, but cannot be efficiently represented by any RBM unless the polynomial hierarchy collapses in the computational complexity theory.

PEPS is a generalization of matrix product states (MPS) to higher dimensions. (See this.)

However, Gao and Duan were able to prove that deep Boltzmann machine (DBM) can bridge the loophole of RBM, as stated in their article:

Any quantum state of n qubits generated by a quantum circuit of depth T can be represented exactly by a sparse DBM with O(nT) neurons.

(diagram adapted from Gao and Duan’s article)

Embedding algorithms, especially word-embedding algorithms, have been one of the recurrent themes of this blog. Word2Vec has been mentioned in a few entries (see this); LDA2Vec has been covered (see this); the mathematical principle of GloVe has been elaborated (see this); I haven’t even covered Facebook’s fasttext; and I have not explained the widely used t-SNE and Kohonen’s map (self-organizing map, SOM).

I have also described the algorithm of Sammon Embedding, (see this) which attempts to capture the likeliness of pairwise Euclidean distances, and I implemented it using Theano. This blog entry is about its implementation in Tensorflow as a demonstration.

Let’s recall the formalism of Sammon Embedding, as outlined in the previous entry:

Assume there are high dimensional data described by $d$-dimensional vectors, $X_i$ where $i=1, 2, \ldots, N$. And they will be mapped into vectors $Y_i$, with dimensions 2 or 3. Denote the distances to be $d_{ij}^{*} = \sqrt{| X_i - X_j|^2}$ and $d_{ij} = \sqrt{| Y_i - Y_j|^2}$. In this problem, $Y_i$ are the variables to be learned. The cost function to minimize is

$E = \frac{1}{c} \sum_{i,

where $c = \sum_{i.

Unlike in previous entry and original paper, I am going to optimize it using first-order gradient optimizer. If you are not familiar with Tensorflow, take a look at some online articles, for example, “Tensorflow demystified.” This demonstration can be found in this Jupyter Notebook in Github.

First of all, import all the libraries required:

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf


Like previously, we want to use the points clustered around at the four nodes of a tetrahedron as an illustration, which is expected to give equidistant clusters. We sample points around them, as shown:

tetrahedron_points = [np.array([0., 0., 0.]), np.array([1., 0., 0.]), np.array([np.cos(np.pi/3), np.sin(np.pi/3), 0.]), np.array([0.5, 0.5/np.sqrt(3), np.sqrt(2./3.)])]

sampled_points = np.concatenate([np.random.multivariate_normal(point, np.eye(3)*0.0001, 10) for point in tetrahedron_points])

init_points = np.concatenate([np.random.multivariate_normal(point[:2], np.eye(2)*0.0001, 10) for point in tetrahedron_points])


Retrieve the number of points, N, and the resulting dimension, d:

N = sampled_points.shape[0]
d = sampled_points.shape[1]


One of the most challenging technical difficulties is to calculate the pairwise distance. Inspired by this StackOverflow thread and Travis Hoppe’s entry on Thomson’s problem, we know it can be computed. Assuming Einstein’s convention of summation over repeated indices, given vectors $a_{ik}$, the distance matrix is:

$D_{ij} = (a_{ik}-a_{jk}) (a_{ik} - a_{jk})^T = a_{ik} a_{ik} + a_{jk} a_{jk} - 2 a_{ik} a_{jk}$,

where the first and last terms are simply the norms of the vectors. After computing the matrix, we will flatten it to vectors, for technical reasons omitted to avoid gradient overflow:

X = tf.placeholder('float')
Xshape = tf.shape(X)

sqX = tf.reduce_sum(X*X, 1)
sqX = tf.reshape(sqX, [-1, 1])
sqDX = sqX - 2*tf.matmul(X, tf.transpose(X)) + tf.transpose(sqX)
sqDXarray = tf.stack([sqDX[i, j] for i in range(N) for j in range(i+1, N)])
DXarray = tf.sqrt(sqDXarray)

Y = tf.Variable(init_points, dtype='float')
sqY = tf.reduce_sum(Y*Y, 1)
sqY = tf.reshape(sqY, [-1, 1])
sqDY = sqY - 2*tf.matmul(Y, tf.transpose(Y)) + tf.transpose(sqY)
sqDYarray = tf.stack([sqDY[i, j] for i in range(N) for j in range(i+1, N)])
DYarray = tf.sqrt(sqDYarray)


And DXarray and DYarray are the vectorized pairwise distances. Then we defined the cost function according to the definition:

Z = tf.reduce_sum(DXarray)*0.5
numerator = tf.reduce_sum(tf.divide(tf.square(DXarray-DYarray), DXarray))*0.5
cost = tf.divide(numerator, Z)


update_rule = tf.assign(Y, Y-0.01*grad_cost/lapl_cost)
init = tf.global_variables_initializer()


The last line initializes all variables in the Tensorflow session when it is run. Then start a Tensorflow session, and initialize all variables globally:

sess = tf.Session()
sess.run(init)


Then run the algorithm:

nbsteps = 1000
c = sess.run(cost, feed_dict={X: sampled_points})
print "epoch: ", -1, " cost = ", c
for i in range(nbsteps):
sess.run(train, feed_dict={X: sampled_points})
c = sess.run(cost, feed_dict={X: sampled_points})
print "epoch: ", i, " cost =


Then extract the points and close the Tensorflow session:

calculated_Y = sess.run(Y, feed_dict={X: sampled_points})
sess.close()


Plot it using matplotlib:

embed1, embed2 = calculated_Y.transpose()
plt.plot(embed1, embed2, 'ro')


This gives, as expected,

This code for Sammon Embedding has been incorporated into the Python package mogu, which is a collection of numerical routines. You can install it, and call:

from mogu.embed import sammon_embedding
calculated_Y = sammon_embedding(sampled_points, init_points)


Word embedding has been a frequent theme of this blog. But the original embedding has been algorithms that perform a non-linear mapping of higher dimensional data to the lower one. This entry I will talk about one of the most oldest and widely used one: Sammon Embedding, published in 1969. This is an embedding algorithm that preserves the distances between all points. How is it achieved?

Assume there are high dimensional data described by $d$-dimensional vectors, $X_i$ where $i=1, 2, \ldots, N$. And they will be mapped into vectors $Y_i$, with dimensions 2 or 3. Denote the distances to be $d_{ij}^{*} = \sqrt{| X_i - X_j|^2}$ and $d_{ij} = \sqrt{| Y_i - Y_j|^2}$. In this problem, $Y_i$ are the variables to be learned. The cost function to minimize is

$E = \frac{1}{c} \sum_{i,

where $c = \sum_{i. To minimize this, use Newton's method by

$Y_{pq} (m+1) = Y_{pq} (m) - \alpha \Delta_{pq} (m)$,

where $\Delta_{pq} (m) = \frac{\partial E(m)}{\partial Y_{pq}(m)} / \left|\frac{\partial^2 E(m)}{\partial Y_{pq} (m)^2} \right|$, and $\alpha$ is the learning rate.

To implement it, use Theano package of Python to define the cost function for the sake of optimization, and then implement the learning with numpy. Define the cost function with the outline above:

import theano
import theano.tensor as T

# define variables
mf = T.dscalar('mf')         # magic factor / learning rate

# coordinate variables
Xmatrix = T.dmatrix('Xmatrix')
Ymatrix = T.dmatrix('Ymatrix')

# number of points and dimensions (user specify them)
N, d = Xmatrix.shape
_, td = Ymatrix.shape

# grid indices
n_grid = T.mgrid[0:N, 0:N]
ni = n_grid[0].flatten()
nj = n_grid[1].flatten()

# cost function
c_terms, _ = theano.scan(lambda i, j: T.switch(T.lt(i, j),
T.sqrt(T.sum(T.sqr(Xmatrix[i]-Xmatrix[j]))),
0),
sequences=[ni, nj])
c = T.sum(c_terms)

s_term, _ = theano.scan(lambda i, j: T.switch(T.lt(i, j),
T.sqr(T.sqrt(T.sum(T.sqr(Xmatrix[i]-Xmatrix[j])))-T.sqrt(T.sum(T.sqr(Ymatrix[i]-Ymatrix[j]))))/T.sqrt(T.sum(T.sqr(Xmatrix[i]-Xmatrix[j]))),
0),
sequences=[ni, nj])
s = T.sum(s_term)

E = s / c

# function compilation and optimization
Efcn = theano.function([Xmatrix, Ymatrix], E)


And implement the update algorithm with the following function:

import numpy

# training
def sammon_embedding(Xmat, initYmat, alpha=0.3, tol=1e-8, maxsteps=500, return_updates=False):
N, d = Xmat.shape
NY, td = initYmat.shape
if N != NY:
raise ValueError('Number of vectors in Ymat ('+str(NY)+') is not the same as Xmat ('+str(N)+')!')

# iteration
Efcn_X = lambda Ymat: Efcn(Xmat, Ymat)
step = 0
oldYmat = initYmat
oldE = Efcn_X(initYmat)
update_info = {'Ymat': [initYmat], 'cost': [oldE]}
converged = False
while (not converged) and step<=maxsteps:
newE = Efcn_X(newYmat)
if np.all(np.abs(newE-oldE)<tol):
converged = True
oldYmat = newYmat
oldE = newE
step += 1
print 'Step ', step, '\tCost = ', oldE
update_info['Ymat'].append(oldYmat)
update_info['cost'].append(oldE)

# return results
update_info['num_steps'] = step
return oldYmat, update_info
else:
return oldYmat


The above code is stored in the file sammon.py. We can test the algorithm with an example. Remember tetrahedron, a three-dimensional object with four points equidistant from one another. We expect the embedding will reflect this by sampling points around these four points. With the code tetrahedron.py, we implemented it this way:

import argparse

import numpy as np
import matplotlib.pyplot as plt

import sammon as sn

argparser = argparse.ArgumentParser('Embedding points around tetrahedron.')
default='embedded_tetrahedron.png',
help='file name of the output plot')

args = argparser.parse_args()

tetrahedron_points = [np.array([0., 0., 0.]),
np.array([1., 0., 0.]),
np.array([np.cos(np.pi/3), np.sin(np.pi/3), 0.]),
np.array([0.5, 0.5/np.sqrt(3), np.sqrt(2./3.)])]

sampled_points = np.concatenate([np.random.multivariate_normal(point, np.eye(3)*0.0001, 10)
for point in tetrahedron_points])

init_points = np.concatenate([np.random.multivariate_normal(point[:2], np.eye(2)*0.0001, 10)
for point in tetrahedron_points])

embed_points = sn.sammon_embedding(sampled_points, init_points, tol=1e-4)

X, Y = embed_points.transpose()
plt.plot(X, Y, 'x')
plt.savefig(args.output_figurename)


It outputs a graph:

There are other such non-linear mapping algorithms, such as t-SNE (t-distributed stochastic neighbor embedding) and Kohonen’s mapping (SOM, self-organizing map).

On August 1, my friends and I attended a meetup host by DC Data Science, titled “Predicting and Understanding Law with Machine Learning.” The speaker was John Nay, a Ph.D. candidate in Vanderbilt University. He presented his research which is at an application of natural language processing on legal enactment documents.

His talk was very interesting, from the similarity of presidents and the chambers, to the kind of topics each party focused on. He used a variety of techniques such as Word2Vec, STM (structural topic modeling), and some common textual and statistical analysis. It is quite a comprehensive study.

His work is demonstrated at predictgov.com. His work can be found in arXiv.

There are many learning algorithms that perform classification tasks. However, very often the situation is that one classifier is better on certain data points, but another is better on other. It would be nice if there are ways to combine the best of all these available classifiers.

# Voting

The simplest way of combining classifiers to improve the classification is democracy: voting. When there are n classifiers that output the same classes, the result can be simply cast by a democratic vote. This method works quite well in many problems. Sometimes, we may need to give various weights to different classifiers to improve the performance.

# Bagging and Boosting

Sometimes we can generate many classifiers with the handful amount of data available with bagging and boosting. By bagging and boosting, different classifiers are built with the same learning algorithm but with different datasets. “Bagging builds different versions of the training set by sampling with replacement,” and “boosting obtains the different training sets by focusing on the instances that are misclassified by the previously trained classifiers.” [Sesmero etal. 2015]

# Fusion

Performance of classifiers depends not only on the learning algorithms and the data, but also the set of features used. While feature generation itself is a bigger and a more important problem (not to be discussed), we do have various ways to combine different features. Sometimes we separate features into different classifiers in which the answers are to be combined, or combine all these features into one classifier. The former is called late fusion, while the latter early fusion.

# Stacking

We can also treat the prediction results of various classifiers as features of another classifiers. It is called stacking. [Wolpert 1992] “Stacking generates the members of the Stacking ensemble using several learning algorithms and subsequently uses another algorithm to learn how to combine their outputs.” [Sesmero etal. 2015] Some recent implementation in computational epidemiology employ stacking as well. [Russ et. al. 2016]

# Hidden Topics and Embedding

There is also a special type of feature generation of one classifier, using hidden topic or embedding as the latent vectors. We can generate a set of latent topics according to the data available using latent Dirichlet allocation (LDA) or correlated topic models (CTM), and describe each datasets using these topics as the input to another classifier. [Phan et. al. 2011] Another way is to represent the data using embedding vectors (such as time-series embedding, Word2Vec, or LDA2Vec etc.) as the input of another classifier. [Czerny 2015]

Embedding has been hot in recent years partly due to the success of Word2Vec, (see demo in my previous entry) although the idea has been around in academia for more than a decade. The idea is to transform a vector of integers into continuous, or embedded, representations. Keras, a Python package that implements neural network models (including the ANN, RNN, CNN etc.) by wrapping Theano or TensorFlow, implemented it, as shown in the example below (which converts a vector of 200 features into a continuous vector of 10):

from keras.layers import Embedding
from keras.models import Sequential

# define and compile the embedding model
model = Sequential()
model.compile('rmsprop', 'mse')  # optimizer: rmsprop; loss function: mean-squared error


We can then convert any features from 0 to 199 into vectors of 20, as shown below:

import numpy as np

model.predict(np.array([10, 90, 151]))


It outputs:

array([[[ 0.02915354,  0.03084954, -0.04160764, -0.01752155, -0.00056815,
-0.02512387, -0.02073313, -0.01154278, -0.00389587, -0.04596512]],

[[ 0.02981793, -0.02618774,  0.04137352, -0.04249889,  0.00456919,
0.04393572,  0.04139435,  0.04415271,  0.02636364, -0.04997493]],

[[ 0.00947296, -0.01643104, -0.03241419, -0.01145032,  0.03437041,
0.00386361, -0.03124221, -0.03837727, -0.04804075, -0.01442516]]])


Of course, one must not omit a similar algorithm called GloVe, developed by the Stanford NLP group. Their codes have been wrapped in both Python (package called glove) and R (library called text2vec).

Besides Word2Vec, there are other word embedding algorithms that try to complement Word2Vec, although many of them are more computationally costly. Previously, I introduced LDA2Vec in my previous entry, an algorithm that combines the locality of words and their global distribution in the corpus. And in fact, word embedding algorithms with a similar ideas are also invented by other scientists, as I have introduced in another entry.

However, there are word embedding algorithms coming out. Since most English words carry more than a single sense, different senses of a word might be best represented by different embedded vectors. Incorporating word sense disambiguation, a method called sense2vec has been introduced by Trask, Michalak, and Liu. (arXiv:1511.06388). Matthew Honnibal wrote a nice blog entry demonstrating its use.

There are also other related work, such as wang2vec that is more sensitive to word orders.

Big Bang Theory (Season 2, Episode 5): Euclid Alternative

DMV staff: Application?
Sheldon: I’m actually more or a theorist.

Note: feature image taken from Big Bang Theory (CBS).