In 2019, Google published a new Python library called “tensornetwork” (arXiv:1905.01330) that facilitates the computation of… tensor networks. Tensor network is a tool from quantum many-body theory, widely used in condensed matter physics. There have been a lot of numerical packages for tensor computation, but this library takes it to the next level because of its distinctive framework.

What is a tensor network, though?

“A tensor network is a collection of tensors with indices connected according to a network pattern. It can be used to efficiently represent a many-body wave-function in an otherwise exponentially large Hilbert space.”

https://www.perimeterinstitute.ca/research/research-initiatives/tensor-networks-initiative

## Renormalization Group (RG)

It is not until recently that tensor networks have its application in machine learning. As stated in a previous post, a mathematical connection between restricted Boltzmann machine (RBM) and variational renormalization group (RG) was drawn. (arXiv:1410.1831) It shedded light to the understanding of interpretability of deep learning, which has been criticized to be a black box. However, RBM is just a type of unsupervised machine learning, but how about others?

Seeing this, Schwab, one of the authors of the RG paper, and Stoudenmire did some work to realize the use of RG in machine learning. Stoudenmore is a physicist, and he made use of density matrix renormalization group (DMRG) that he is familiar with, and invented a supervised learning algorithm, which is later renamed tensor network machine learning (TNML). The training is adapted from the sweeping algorithm, the standard of DMRG, that combining bipartite site one-by-one, updating it, and decomposing into two site by studying its quantum entanglment (using singular value decomposition, or Schmidt decomposition).

Instead of bringing interpretability to deep learning, this work in fact opened a new path of new machine learning algorithms with known techniques.

## What is RG?

Renormalization group (RG) is a formalism of “zooming out” in scale-invariant system, determining which terms to truncate in a model. It is an important formalism in high energy physics and statistical field theory. (See Ma’s book for reference.)

Density matrix renormalization group (RG) is a variational real-space numerical technique that look at collections of quantum bits (zoomed-out) as a block. It was invented by Steven White, and it is useful in studying strongly correlated electronic systems. (PRL 69 (19): 2863-2866 (1992)). However, the original DMRG paper is not very accessible, until it is rephrased using the tensor network notation (TNN), as shown in Schollwoeck’s article.

## Is Tensor Network Related to Quantum Computing?

This is not an easy question to answer. Tensor networks come from quantum physics, but quantum physics is usually not directly leading to quantum computing. In fact, classical computing hardwares have a lot of quantum physics in it. A simple answer to this question is no, as the algorithm using tensor network is implemented in classical computers.

There have been a lot of publications on quantum machine learning lately. A classic book on this topic is written by Peter Wittek. The book covers topics on basic machine learning and quantum computing, and then quantum machine learning algorithms. There is a quantum counterpart of each of the common machine learning algorithms in the book. However, we know it would be much more useful if there are new algorithms exploiting the advantages of quantum computing. Tensor network is a natural choice as it builds on qubits, and the representations and operations are naturally quantum.

## Next…

Tensor network is an interesting subject from both a theoretical and applicational perspective. In coming posts I will talk about its application on machine learning and a taste of codes.

• Chase Roberts, Ashley Milsted, Martin Ganahl, Adam Zalcman, Bruce Fontaine, Yijian Zou, Jack Hidary, Guifre Vidal, Stefan Leichenauer, “TensorNetwork: A Library for Physics and Machine Learning,” arXiv:1905.01330 (2019). [arXiv]
• “Google TensorNetwork Library Dramatically Accelerates ML & Physics Tasks,” Syncedreview. (2019) [Medium]
• Chase Roberts, “Introducing TensorNetwork, an Open Source Library for Efficient Tensor Calculations,” Google AI Blog. (2019) [GoogleAIBlog]
• “Tensor Networks and Density Matrix Renormalization Group,” Everything About Data Analytics. (2016) [WordPress]
• P. Mehta, D. J. Schwab, “An exact mapping between the Variational Renormalization Group and Deep Learning,” arXiv:1410.3831 (2014). [arXiv]
• Sheng-kang Ma, Modern Theory of Critical Phenomena, (New York, NY: Routledge, 2018). [Amazon]
• S. R. White, “Density matrix formulation for quantum renormalization groups,” Phys. Rev. Lett. 69, 2863 (1992). [APS]
• Ulrich Schollwoeck, “The density-matrix renormalization group,” Rev. Mod. Phys. 77, 259 (2005); arXiv:cond-mat/0409292. [arXiv]
• Ulrich Schollwoeck, “The density-matrix renormalization group in the age of matrix product states,” Annals of Physics 326, 96 (2011); arXiv:1008.3477. [arXiv]
• Peter Wittek, Quantum Machine Learning: What Quantum Computing Means to Data Mining (San Diego, CA: Academic Press, 2014). [Amazon] [PDF]
• Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, Nathan Wiebe, Seth Lloyd, “Quantum Machine Learning,” Nature 549, 195-202 (2017). [Nature]
• Tensor Networks: From Entangled Quantum Matter to Emergent Space Time, Perimeter Institute. [Perimeter]

Feature picture taken from Perimeter Institute.

Generative adversarial networks (GANs) have made a big impact to the world of machine learning. It is particularly useful for generating sample data when there are insufficient data for certain purposes. It is also useful for training using data with both labeled and unlabeled data, i. e., semi-supervised learning. (SSL)

The rise of GANs also lead to the re-emergence of adversarial learning regarding the handling of unbalanced data or sensitive data. (For example, see arXiv:1707.00075.)

GAN is particularly useful for computer vision problems. However, it is not very good for natural language problems as the data cannot be generated continuously. Under this context, a modification on GAN is developed, called discriminative adversarial networks (DAN, see arXiv:1707.02198.). Unlike GANs that has a discriminator to train a generator to produce good data, DAN has two discriminators: one discriminator, usually denoted as the predictor P, that predicts on the unlabeled data, and another, usually denoted as the judge J, that classifies whether the label is a human label or a machine-predicted label.

The loss function of DAN is very similar to that of GAN: minimizing the entropy difference for the judge J for labeled data, but minimizing that for predictions for unlabeled data for the predictor P.

However, GAN and DAN are not generative-disciminative pairs.

Deep learning has achieved a big success in the past few years, but its interpretive power is limited. They work largely because of the abundance of data. On the other hand, traditional machine learning algorithms are much better in interpretive power, but manual feature engineering costs a lot, due to the lack of data in earlier era. In light of this, a group of scientists initiated the work of graph networks, aiming at devising new artificial intelligence algorithms that exploits the advantages of two worlds, while still holding the principle of combinatorial generalization in constructing methods by using known building blocks to build new methods. Graph is good at interpretation as it is good for relational representation.

The use of graph networks is more than the graph convolutional neural networks (GCN) in the previous two blog entries. (part I and part II) However, to achieve relational inductive biases, an entity (an element with attributes), a relation, (a property between entities) and a rule. (a function that maps entities and relations to other entities and relations) This can be realized using graph, which is a mathematical structure that contains nodes and edges (that connect nodes.) To generalize the use of graph networks in various machine learning and deep learning methods, they reviewed the graph block, which is basically a function, or a mapping, from a graph to another graph, as shown in the algorithm below:

Works of graph networks are not non-existent; the authors listed previous works that can be seen as graph networks, for example:

• Message-passing neural network (MPNN) (2017);
• Non-local neural networks (NLNN) (2018).

The use of graph networks, I believe, is the next trend. There have been works regarding the graph-powered machine learning. (see Google AI blog, GraphAware Slideshare) I recently started an open-source project, a Python package called graphflow, to explore various algorithms using graphs, including PageRank, HITS, resistance, and non-linear resistance.

In the previous post, the convolution of the graph Laplacian is defined in its graph Fourier space as outlined in the paper of Bruna et. al. (arXiv:1312.6203) However, the eigenmodes of the graph Laplacian are not ideal because it makes the bases to be graph-dependent. A lot of works were done in order to solve this problem, with the help of various special functions to express the filter functions. Examples include Chebyshev polynomials and Cayley transform.

## ChebNet

Kipf and Welling proposed the ChebNet (arXiv:1609.02907) to approximate the filter using Chebyshev polynomial, using the result proved by Hammond et. al. (arXiv:0912.3848) With a convolutional layer $g_{\theta} \circ x$, instead of computing it directly by $\Phi^T g_{\theta} \Phi x$, it approximates the filter in terms of Chebyshev polynomial:

$g_{\theta} (\Lambda) \approx \sum_{k=0}^K \theta_k T_k(\Lambda)$,

where $T_k(x)$ are the Chebyshev polynomials. It has been proved by Hammond et. al. that by well truncating this representation, Chebyshev polynomials are very good to approximate the function. Then the convolution is:

$g_{\theta} \circ x \approx \sum_{k=0}^K \theta_k T_k (\Lambda) x$.

Here, $\theta$’s are the parameters to be trained. This fixed the bases of the representation, and it speeds up computation. The disadvantage is that eigenvalues are clusters in a few values with large gaps.

## CayleyNet

The problem of ChebNet led to the work of Levie et. al., (arXiv:1705.07664) who proposed another approximation is used using Cayley transform. They made use of the Cayley function:

$C(\lambda) = \frac{\lambda - i}{\lambda + i}$,

which is a bijective function from $\mathbb{R}^+$ to complex unit half-circle. Instead of Chebyshev polynomials, it approximates the filter as:

$g(\lambda) = c_0 + \sum_{j=1}^r \left[ c_j C^j (h \lambda) + c_j^{\ast} C^{j \ast} (h \lambda) \right]$,

where $c_0$ is real and other $c_j$’s are generally complex, and $h$ is a zoom parameter, and $\lambda$’s are the eigenvalues of the graph Laplacian. Tuning $h$ makes one find the best zoom that spread the top eigenvalues. $c$‘s are computed by training. This solves the problem of unfavorable clusters in ChebNet.

## MotifNet

All previous works are undirected graph. How do we deal with directed graph with an asymmetric graph Laplacian? Benson, Gleich, Leskovec published an important work on Science in 2016 (arXiv:1612.08447) to address this problem. Their approach is reducing a directed graph to a higher order structure called network motifs. There are 13 network motifs. For each network motif, one can define an adjacency matrix for that motif by $\mathcal{M}_k$, with elements $\mathcal{M}_{k, ij}$ being the number of motifs in the graph the the edge $(i, j)$ that it belongs to.

Then one computes 13 graph Laplacians from these 13 adjacency matrices. These graph Laplacians are symmetric, like those of undirected graphs. Then any filters can be approximated by the following multivariate matrix polynomial, as suggested by Monti, Otness, and Bronstein in their MotifNet paper (arXiv:1802.01572):

$f_{\Theta} (\Delta_1, \dots, \Delta_k) = \sum_{j=0}^P \sum_{k_1, \dots, k_j \in {1, \ldots, K}} \theta_{\theta_1, \ldots, \theta_k} \Delta_{k_1}, \ldots, \Delta_{k_j}$.

Applications of image processing, citation networks etc.

Suggested by some friends, I have been reading graph convolutional neural network. This is not a recent interest, as I have been interested in some graph-related problems (as I developed the graphflow package) and topological data analysis (TDA, as I developed the mogutda package). Graph theory is not a new field as well, but it is interesting to see how it is being applied in deep learning, as a lot of real-world relational data can be expressed in terms of graphs. (Examples: word relations, organizational structure, genes, internet etc.) With the increasing attention to neo4j, we know graph is coming to be hot.

It would be helpful to review some basic graph theory:

• A graph is represented by $\mathcal{G} = (V, E)$, where $V$ and $E$ are the nodes (vertices) and edges respectively.
• A graph can be directed or undirected. In graph convolutional neural network, they are undirected usually.
• The adjacency matrix $A$ describes how nodes are connected: $A_{ij} = 1$ if there is an edge connecting from node $i$ to node $j$, and $0$ otherwise. $A$ is a symmetric matrix for an undirected graph.
• The incidence matrix $B$ is another way to describe how nodes are connected: $B_{ij} = 1$ if a node $i$ is connected with edge $j$. This is useful for undirected graph.
• The degree matrix $D$ is a diagonal matrix, with elements $D_{ii}$ denotes the number of neighbors for node $i$ in undirected matrix.
• The function acting on the nodes is called the filter.
• The graph Laplacian, or Kirchhoff matrix, is defined by $L = D - A$, and the normalized graph Laplacian is $\tilde{L} = I - D^{-\frac{1}{2}} A D^{-\frac{1}{2}}$.

The graph Laplacian is the most important matrix in graph convolutional neural network. It is analogous to the Laplacian operator in Euclidean space, $\nabla^2$. The reader can easily verify this by constructing a graph of 2D lattice and compute the graph Laplacian matrix, and find that it is the same as the discretized Laplacian operator.

We can also get some insights from the Euclidean analogue. In physics, the solution to the Laplacian equation is harmonic: the basis of the solution can be described in the spectral/Fourier space, as:

$\nabla^2 \psi = - \omega^2 \psi$,

And $\psi \propto e^{\pm i \omega x}$. In graph convolutional neural network, as Bruna et. al. suggested in 2013, the graph is calculated in the graph Fourier space, instead of directly dealing with the Laplacian matrix in all layers of network.

On the other hand, we know that for the convolution

$(f \circ g) (x) = \int dy f(x-y) g(y)$,

its Fourier transform is given by

$\tilde{( f \circ g)} (\omega) = \tilde{f} (\omega) \tilde{g} (\omega)$.

In Fourier space, the convolution of two functions are just their products. Similarly, in graph convolutional neural network, convolutions can be computed in the Fourier space as the mere product of two filters in the Fourier space. More specifically, for finding the convolution of the filters $f$ and $g$, with $\Phi$ being the unitary eigenmatrix,

$f \circ g = \Phi ((\Phi^T g) \circ (\Phi^T f)) = \Phi \textbf{diag}(g_1, g_2, \ldots, g_n) f$.

However, such general description is basis-dependent, and it is still computationally expensive. More work has been proposed to smooth the representation, which will be covered in the upcoming blogs.

A side note, the readers can verify themselves that

$\sum_{ij} A_{ij} | f_i - g_j |^2 = f \tilde{L} g$

Sebastian Ruder recently wrote an article on The Gradient and asserted that the oracle of natural language processing is emerging. While I am not sure such confident statement is overstated, I do look forward to the moment that we will download pre-trained embedded language models and transfer to our use cases, just like we are using pre-trained word-embedding models such as Word2Vec and FastText.

I do not think one can really draw a parallelism between computer vision and natural language processing. Computer vision is challenging, but natural language processing is even more difficult because the tasks regarding linguistics are not limited to object or meaning recognition, but also human psychology, cultures, and linguistic diversities. The objectives are far from being identical.

However, the transferrable use of embedded language models is definitely a big step forward. Ruder quoted three articles, which I would summarize below in a few words.

• Embeddings from Language Models (ELMo, arXiv:1802.05365): based on the successful bidirectional LSTM language models, the authors developed a deep contextualized embedded models by collapses all layers in the neural network architecture.
• Universal Language Model Fine-Tuning for Text Classification (ULMFiT, arXiv:1801.06146): the authors proposed a type of architectures that learn representations for specific tasks, which involve three steps in training: a) LM pre-training: learning through unlabeled corpus with abundant data; b) LM fine-tuning: learning through labeled corpus; and c) classifier fine-tuning: transferred training for specific classification tasks.
• OpenAI Transformer (article still in progress): the author proposed a simple generative language model with the three similar steps in ULMFit: a) unsupervised pre-training: training a language model that maximizes the likelihood of a sequence of tokens within a context window; b) supervised fine-tuning: a supervised classification training that maximizes the likelihood using the Bayesian approach; c) task-specific input transformations: training the classifiers on a specific task.

These three articles are intricately related to each other. Without abundant data and good hardware, it is almost impossible to produce the language models. As Ruder suggested, we will probably have a pre-trained model up to the second step of the ULMFit and OpenAI Transformer papers, but we train our own specific model for our use. We have been doing this for word-embedding models, and this approach has been common in computer vision too.

Google launches her AutoML project last year, in an effort to automate the process of seeking the most appropriate neural net designs for a particular classification problem. Designing neural networks have been time consuming, despite the use of TensorFlow / Keras or other deep learning architecture nowadays. Therefore, the Google Brain team devised the Neural Architecture Search (NAS) using a recurrent neural network to perform reinforcement learning. (See their blog entry.) It is used to find the neural networks for image classifiers. (See their blog entry.)

Apparently, with a state-of-the-art hardware, it is of Google’s advantage to perform such an experiment on the CIFAR-10 dataset using 450 GPUs for 3-4 days. But this makes the work inaccessible for small companies or personal computers.

Then it comes an improvement to NAS: the Efficient Neural Architecture Search via Parameter Sharing (ENAS), which is a much more efficient method to search for a neural networks, by narrowing down the search in a subgraph. It reduces the need of GPUs.

While I do not think it is a threat to machine learning engineers, it is a great algorithm to note. It looks to me a brute-force algorithm, but it needs scientists and engineers to gain insights. Still, I believe development of the theory behind neural networks is much needed.

Recently, Geoffrey Hinton and his colleagues made the article about capsules available. He has been known to heavily criticize the use of pooling and back propagation.

“A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or object part.” The nodes of inputs and outputs are vectors, instead of scalars as in neural networks. A cheat sheet comparing the traditional neurons and capsules is as follow:

Based on the capsule, the authors suggested a new type of layer called CapsNet.

Huadong Liao implemented CapsNet with TensorFlow according to the paper. (Refer to his repository.)

The theory and the interpretability of deep neural networks have always been called into questions. In the recent few years, there have been several ideas uncovering the theory of neural networks.

# Renormalization Group (RG)

Mehta and Schwab analytically connected renormalization group (RG) with one particular type of deep learning networks, the restricted Boltzmann machines (RBM). (See their paper and a previous post.) RBM is similar to Heisenberg model in statistical physics. This weakness of this work is that it can only explain only one type of deep learning algorithms.

However, this insight gives rise to subsequent work, with the use of density matrix renormalization group (DMRG), entanglement renormalization (in quantum information), and tensor networks, a new supervised learning algorithm was invented. (See their paper and a previous post.)

# Neural Networks as Polynomial Approximation

Lin and Tegmark were not satisfied with the RG intuition, and pointed out a special case that RG does not explain. However, they argue that neural networks are good approximation to several polynomial and asymptotic behaviors of the physical universe, making neural networks work so well in predictive analytics. (See their paper, Lin’s reply on Quora, and a previous post.)

# Information Bottleneck (IB)

Tishby and his colleagues have been promoting information bottleneck as a backing theory of deep learning. (See previous post.) In recent papers such as arXiv:1612.00410, on top of his information bottleneck, they devised an algorithm using variation inference.

# Generalization

Recently, Kawaguchi, Kaelbling, and Bengio suggested that “deep model classes have an exponential advantage to represent certain natural target functions when compared to shallow model classes.” (See their paper and a previous post.) They provided their proof using generalization theory. With this, they introduced a new family of regularization methods.

# Geometric View on Generative Adversarial Networks (GAN)

Recently, Lei, Su, Cui, Yau, and Gu tried to offer a geometric view of generative adversarial networks (GAN), and provided a simpler method of training the discriminator and generator with a large class of transportation problems. However, I am still yet to understand their work, and their experimental results were done on low-dimensional feature spaces. (See their paper.) Their work is very mathematical.

In their paper, Kawaguchi, Kaelbling, and Bengio explored the theory of why generalization in deep learning is so good. Based on their theoretical insights, they proposed a new regularization method, called Directly Approximately Regularizing Complexity (DARC), in addition to commonly used Lp-regularization and dropout methods.

This paper explains why deep learning can generalize well, despite large capacity and possible algorithmic instability, nonrobustness, and sharp minima, effectively addressing an open problem in the literature. Based on our theoretical insight, this paper also proposes a family of new regularization methods. Its simplest member was empirically shown to improve base models and achieve state-of-the-art performance on MNIST and CIFAR-10 benchmarks. Moreover, this paper presents both data-dependent and data-independent generalization guarantees with improved convergence rates. Our results suggest several new open areas of research.