Data Representation in Machine Learning

In implementing most of the machine learning algorithms, we represent each data point with a feature vector as the input. A vector is basically an array of numerics, or in physics, an object with magnitude and direction. How do we represent our business data in terms of a vector?

Primitive Feature Vector

Whether the data are measured observations, or images (pixels), free text, factors, or shapes, they can be categorized into four following types:

  1. Categorical data
  2. Binary data
  3. Numerical data
  4. Graphical data

The most primitive representation of a feature vector looks like this:

Screen Shot 2019-09-15 at 3.58.09 PM
A typical feature vector. (Source:

Numerical Data

Numerical data can be represented as individual elements above (like Tweet GRU, Query GRU), and I am not going to talk too much about it.

Categorical Data

However, for categorical data, how do we represent them? The first basic way is to use one-hot encoding:

Screen Shot 2019-09-15 at 4.02.51 PM
One-hot encoding of categorical data (Source:

For each type of categorical data, each category has an integer code. In the figure above, each color has a code (0 for red, 1 for orange etc.) and they will eventually be transformed to the feature vector on the right, with vector length being the total number of categories found in the data, and the element will be filled with 1 if it is of that category. This allows a natural way of dealing with missing data (with all elements 0) and multi-category (with multiple non-zeros).

In natural language processing, the bag-of-words model is often used to represent free-text data, which is the one-hot encoding above with words as the categories. It is a good way as long as the order of the words does not matter.

Binary Data

For binary data, it can be easily represented by one element, either 1 or 0.

Graphical Data

Graphical data are best represented in terms of graph Laplacian and adjacency matrix. Refer to a previous blog article for more information.


A feature vector can be a concatenation of various features in terms of all these types except graphical data.

However, such representation that concatenates all the categorical, binary, and numerical fields has a lot of shortcomings:

  1. Data with different categories are often seen as orthogonal, i.e., perfectly dissimilar.  It ignores the correlation between different variables. However, it is a very big assumption.
  2. The weights of different fields are not considered.
  3. Sometimes if the numerical values are very large, it outweighs other categorical data in terms of influence in computation.
  4. Data are very sparse, costing a lot of memory waste and computing time.
  5. It is unknown whether some of the data are irrelevant.

Modifying Feature Vectors

In light of the shortcomings, to modify the feature factors, there are three main ways of dealing with this:

  1. Rescaling: rescaling all of some of the elements, or reweighing, to adjust the influence from different variables.
  2. Embedding: condensing the information into vectors of smaller lengths.
  3. Sparse coding: deliberately extend the vectors to a larger length.


Rescaling means rescaling all or some of the elements in the vectors. Usually there are two ways:

  1. Normalization: normalizing all the categories of one feature to having the sum of 1.
  2. Term frequency-inverse document frequency (tf-idf): weighing the elements so that the weights are heavier if the frequency is higher and it appears in relatively few documents or class labels.


Embedding means condensing a sparse vector to a smaller vector. Many sparse elements disappear and information is encoded inside the elements. There are rich amount of work on this.

  1. Topic models: finding the topic models (latent Dirichlet allocation (LDA),  structural topic models (STM) etc.) and encode the vectors with topics instead;
  2. Global dimensionality reduction algorithms: reducing the dimensions by retaining the principal components of the vectors of all the data, e.g., principal component analysis (PCA), independent component analysis (ICA), multi-dimensional scaling (MDS) etc;
  3. Local dimensionality reduction algorithms: same as the global, but these are good for finding local patterns, where examples include t-Distributed Stochastic Neighbor Embedding (tSNE) and Uniform Manifold Approximation and Projection (UMAP);
  4. Representation learned from deep neural networks: embeddings learned from encoding using neural networks, such as auto-encoders, Word2Vec, FastText, BERT etc.
  5. Mixture Models: Gaussian mixture models (GMM), Dirichlet multinomial mixture (DMM) etc.
  6. Others: Tensor decomposition (Schmidt decomposition, Jennrich algorithm etc.), GloVe etc.

Sparse Coding

Sparse coding is good for finding basis vectors for dense vectors.

Continue reading “Data Representation in Machine Learning”

Neural-Network Representation of Quantum Many-Body States

There are many embeddings algorithm for representations. Sammon embedding is the oldest one, and we have Word2Vec, GloVe, FastText etc. for word-embedding algorithms. Embeddings are useful for dimensionality reduction.

Traditionally, quantum many-body states are represented by Fock states, which is useful when the excitations of quasi-particles are the concern. But to capture the quantum entanglement between many solitons or particles in a statistical systems, it is important not to lose the topological correlation between the states. It has been known that restricted Boltzmann machines (RBM) have been used to represent such states, but it has its limitation, which Xun Gao and Lu-Ming Duan have stated in their article published in Nature Communications:

There exist states, which can be generated by a constant-depth quantum circuit or expressed as PEPS (projected entangled pair states) or ground states of gapped Hamiltonians, but cannot be efficiently represented by any RBM unless the polynomial hierarchy collapses in the computational complexity theory.

PEPS is a generalization of matrix product states (MPS) to higher dimensions. (See this.)

However, Gao and Duan were able to prove that deep Boltzmann machine (DBM) can bridge the loophole of RBM, as stated in their article:

Any quantum state of n qubits generated by a quantum circuit of depth T can be represented exactly by a sparse DBM with O(nT) neurons.


(diagram adapted from Gao and Duan’s article)

Continue reading “Neural-Network Representation of Quantum Many-Body States”

Dream of Automation

It is a fantasy for a lot of entrepreneurs, scientists and engineers to develop a software project that can automatically perform feature generation, training, and prediction automatically.

Of course it is a wishful thinking. There is no free lunch.

In big companies that have abundant resources (training data, brains, clusters), they can probably so something like deep learning to get the relevant features, and build classification models. It is almost automatic. It virtually takes no manual addition of human knowledge. Some scientists and engineers are enjoying the strength of word2vec, but it takes a lot of computer resources to even train a word2vec model.

If we do not have enough training data or computing resources, to get a good classifier, we ought to add human knowledge to generate features. We might even need to impose some rules to convert the raw data to sensible features. The rules might be regular expressions, or some calculations, or some filters, or it involves a knowledge database (like WordNet). Things might be simplified if the problem we are dealing with is in a specific domain, that reduces the amount of human knowledge we need to add.

Create a free website or blog at

Up ↑