In 2019, Google published a new Python library called “tensornetwork” (arXiv:1905.01330) that facilitates the computation of… tensor networks. Tensor network is a tool from quantum many-body theory, widely used in condensed matter physics. There have been a lot of numerical packages for tensor computation, but this library takes it to the next level because of its distinctive framework.

What is a tensor network, though?

“A tensor network is a collection of tensors with indices connected according to a network pattern. It can be used to efficiently represent a many-body wave-function in an otherwise exponentially large Hilbert space.”

https://www.perimeterinstitute.ca/research/research-initiatives/tensor-networks-initiative

## Renormalization Group (RG)

It is not until recently that tensor networks have its application in machine learning. As stated in a previous post, a mathematical connection between restricted Boltzmann machine (RBM) and variational renormalization group (RG) was drawn. (arXiv:1410.1831) It shedded light to the understanding of interpretability of deep learning, which has been criticized to be a black box. However, RBM is just a type of unsupervised machine learning, but how about others?

Seeing this, Schwab, one of the authors of the RG paper, and Stoudenmire did some work to realize the use of RG in machine learning. Stoudenmore is a physicist, and he made use of density matrix renormalization group (DMRG) that he is familiar with, and invented a supervised learning algorithm, which is later renamed tensor network machine learning (TNML). The training is adapted from the sweeping algorithm, the standard of DMRG, that combining bipartite site one-by-one, updating it, and decomposing into two site by studying its quantum entanglment (using singular value decomposition, or Schmidt decomposition).

Instead of bringing interpretability to deep learning, this work in fact opened a new path of new machine learning algorithms with known techniques.

## What is RG?

Renormalization group (RG) is a formalism of “zooming out” in scale-invariant system, determining which terms to truncate in a model. It is an important formalism in high energy physics and statistical field theory. (See Ma’s book for reference.)

Density matrix renormalization group (RG) is a variational real-space numerical technique that look at collections of quantum bits (zoomed-out) as a block. It was invented by Steven White, and it is useful in studying strongly correlated electronic systems. (PRL 69 (19): 2863-2866 (1992)). However, the original DMRG paper is not very accessible, until it is rephrased using the tensor network notation (TNN), as shown in Schollwoeck’s article.

## Is Tensor Network Related to Quantum Computing?

This is not an easy question to answer. Tensor networks come from quantum physics, but quantum physics is usually not directly leading to quantum computing. In fact, classical computing hardwares have a lot of quantum physics in it. A simple answer to this question is no, as the algorithm using tensor network is implemented in classical computers.

There have been a lot of publications on quantum machine learning lately. A classic book on this topic is written by Peter Wittek. The book covers topics on basic machine learning and quantum computing, and then quantum machine learning algorithms. There is a quantum counterpart of each of the common machine learning algorithms in the book. However, we know it would be much more useful if there are new algorithms exploiting the advantages of quantum computing. Tensor network is a natural choice as it builds on qubits, and the representations and operations are naturally quantum.

## Next…

Tensor network is an interesting subject from both a theoretical and applicational perspective. In coming posts I will talk about its application on machine learning and a taste of codes.

• Chase Roberts, Ashley Milsted, Martin Ganahl, Adam Zalcman, Bruce Fontaine, Yijian Zou, Jack Hidary, Guifre Vidal, Stefan Leichenauer, “TensorNetwork: A Library for Physics and Machine Learning,” arXiv:1905.01330 (2019). [arXiv]
• “Google TensorNetwork Library Dramatically Accelerates ML & Physics Tasks,” Syncedreview. (2019) [Medium]
• Chase Roberts, “Introducing TensorNetwork, an Open Source Library for Efficient Tensor Calculations,” Google AI Blog. (2019) [GoogleAIBlog]
• “Tensor Networks and Density Matrix Renormalization Group,” Everything About Data Analytics. (2016) [WordPress]
• P. Mehta, D. J. Schwab, “An exact mapping between the Variational Renormalization Group and Deep Learning,” arXiv:1410.3831 (2014). [arXiv]
• Sheng-kang Ma, Modern Theory of Critical Phenomena, (New York, NY: Routledge, 2018). [Amazon]
• S. R. White, “Density matrix formulation for quantum renormalization groups,” Phys. Rev. Lett. 69, 2863 (1992). [APS]
• Ulrich Schollwoeck, “The density-matrix renormalization group,” Rev. Mod. Phys. 77, 259 (2005); arXiv:cond-mat/0409292. [arXiv]
• Ulrich Schollwoeck, “The density-matrix renormalization group in the age of matrix product states,” Annals of Physics 326, 96 (2011); arXiv:1008.3477. [arXiv]
• Peter Wittek, Quantum Machine Learning: What Quantum Computing Means to Data Mining (San Diego, CA: Academic Press, 2014). [Amazon] [PDF]
• Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, Nathan Wiebe, Seth Lloyd, “Quantum Machine Learning,” Nature 549, 195-202 (2017). [Nature]
• Tensor Networks: From Entangled Quantum Matter to Emergent Space Time, Perimeter Institute. [Perimeter]

Feature picture taken from Perimeter Institute.

A while ago, Mehta and Schwab drew a connection between Restricted Boltzmann Machine (RBM), a type of deep learning algorithm, and renormalization group (RG), a theoretical tool in physics applied on critical phenomena. [Mehta & Schwab, 2014; see previous entry] Can RG be able to relate to other deep leaning algorithms?

Schwab wrote a paper on a new machine learning algorithm that directly exploit a type of RG in physics: the density matrix renormalization group (DMRG). DMRG is used in condensed matter physics for low-dimensional (d=1 or 2) lattice systems. DMRG was invented by Steve White, using diagonalization of reduced density matrices on each site. [White 1992] However, now it was performed using singular value decomposition for each successive pair of lattice sites.

DMRG is related to quantum entanglement, which is a two-site quantum system, and the entanglement can be characterized by any of its reduced density matrix. However, DMRG deals with reduced density matrix of all sites. Traditionally, this kind of many body systems can be represented by the kets:

$|\Psi \rangle = \sum_{\sigma_1 \ldots \sigma_L} c^{\sigma_1} \ldots c^{\sigma_L} |\sigma_1 \ldots \sigma_L \rangle$.

These c‘s are c-numbers. To describe the entanglement of these states but to remain numerically convenient, it is desirable to convert these c-numbers into matrices: [Schollwöck 2013]

$c^{\sigma_1} \ldots c^{\sigma_L} \rightarrow M^{\sigma_1} \ldots M^{\sigma_L}$.

And these are tensor networks. DMRG aims at finding a good description of the states with these tensor networks. These tensor networks have nice graphical representation, as in the appendix of the paper by Stoudenmire and Schwab. The training is also described in their paper elegantly using these tensor network diagrams. Their new algorithm proves to be a good new machine learning algorithm, probably fit for small data but complicated features. This is a direct application of real-space RG in machine learning algorithm. Stoudenmire wrote in Quora about the value of this work:

“In our work… we reached state-of-the-art accuracy for the MNIST dataset without needing extra techniques such as convolutional layers. One exciting aspect of these proposals is that their cost scales at most linearly in the number of training examples, versus quadratically for most kernel methods. Representing parameters by a tensor network gives them a structure that can be analyzed to better understand the model and what it has learned. Also tensor network optimization methods are adaptive, automatically selecting the minimum number of parameters necessary for the optimal solution within a certain tensor network class.” – Miles Stoudenmire, in Quora

There are some extension algorithms from DMRG, such as multiscale entanglement renormalization ansatz (MERA), developed by Vidal and his colleagues. [Vidal 2008]

Steve R. White (adapted from his faculty homepage)

Tensor Diagram of the Training of this New Algorithm. (Take from arXiv:1605.05775)

Ever since Mehta and Schwab laid out the relationship between restricted Boltzmann machines (RBM) and deep learning mathematically (see my previous entry), scientists have been discussing why deep learning works so well. Recently, Henry Lin and Max Tegmark put a preprint on arXiv (arXiv:1609.09225), arguing that deep learning works because it captures a few essential physical laws and properties. Tegmark is a cosmologist.

Physical laws are simple in a way that a few properties, such as locality, symmetry, hierarchy etc., lead to large-scale, universal, and often complex phenomena. A lot of machine learning algorithms, including deep learning algorithms, have deep relations with formalisms outlined in statistical mechanics.

A lot of machine learning algorithms are basically probability theory. They outlined a few types of algorithms that seek various types of probabilities. They related the probabilities to Hamiltonians in many-body systems.

They argued why neural networks can approximate functions (polynomials) so well, giving a simple neural network performing multiplication. With central limit theorem or Jaynes’ arguments (see my previous entry), a lot of multiplications, they said, can be approximated by low-order polynomial Hamiltonian. This is like a lot of many-body systems that can be approximated by 4-th order Landau-Ginzburg-Wilson (LGW) functional.

Properties such as locality reduces the number of hyper-parameters needed because it restricts to interactions among close proximities. Symmetry further reduces it, and also computational complexities. Symmetry and second order phase transition make scaling hypothesis possible, leading to the use of the tools such as renormalization group (RG). As many people have been arguing, deep learning resembles RG because it filters out unnecessary information and maps out the crucial features. Tegmark use classifying cats vs. dogs as an example, as in retrieving temperatures of a many-body systems using RG procedure. They gave a counter-example to Schwab’s paper with the probabilities cannot be preserved by RG procedure, but while it is sound, but it is not the point of the RG procedure anyway.

They also discussed about the no-flattening theorems for neural networks.

The descriptive power of deep learning has bothered a lot of scientists and engineers, despite its powerful applications in data cleaning, natural language processing, playing Go, computer vision etc. A while ago, as stated in my previous blog entry, Mehta and Schwab discussed the mathematical equivalence between renormalization group (RG) and restricted Boltzmann machines (RBM), a type of deep learning algorithm. [Mehta & Schwab, 2014] I think it is insightful in a way that in each round of calculation, irrelevant information is filtered out by diminishing the weight. Each step is sort of like an RG step. However, this work has two weaknesses: 1) it is restricted to one specific type of deep learning, i.e., RBM; 2) it does not provide insight of how to choose the hyperparameters. It offers an insightful explanation, but it is not useful.

A few weeks ago, one of my friends introduced to me the work by Tishby and his colleagues. It does not only provide insights to why deep learning works, but also sheds light on how to choose hyperparameters. It makes use of the concept of information bottleneck (IB). Information bottleneck is a technique in information theory that aims at capturing the relevant information in the input variables x so that the output variable y can be most accurately predicted. A technique derived by Tishby, [Tishby & Pereira, 1999] it is proposed to use in choosing the hyperparameters of deep neural networks (DNN). [Tishby & Zalavsky, 2015] The idea is to get a functional, the DNN itself in this context, that captures the most relevant information in x to output y. So instead of coarse-graining information in each step as in RG, the algorithm is to have the most compact form before it was even trained. It is not only insightful, but sounds practical.

But its practicality needs to be tested over time.

Leo Kadanoff passed away on October 26, 2015.

Leo Kadanoff is an American physicist in University of Chicago. His most prominent work is the idea of block spin and coarse-graining in statistical physics. [Kadanoff 1966] His work has an enormous impact on second-order phase transition and critical phenomena, based on the knowledge of scale and universality. His idea was further developed into renormalization group (RG), [Wilson 1983] which leads to Ken Wilson awarded with Nobel Prize in Physics in 1982.

The concept of RG has also been used to explain how deep learning works, [Mehta, Schwab 2014] which you can read more about from my previous blog entry and their paper. While only the equivalence between RG and Restricted Boltzmann Machine was rigorously proved, it sheds a lot of insights about how it works, in a way that I believe it is roughly what happens. Without the concept that Kadanoff developed, it is impossible for Mehta and Schwab to make such a connection between critical phenomena and neural network.

He has other contributions such as computational physics, urban planning, computer science, hydrodynamics, biology, applied mathematics and geophysics. He has been awarded with the Wolf Prize in Physics (1980), Elliott Cresson Medal(1986), Lars Onsager Prize (1998), Lorentz Medal (2006), and Isaac Newton Medal (2011).

His work has a significant impact on statistical physics, including problems of second-order phase transition, percolation, various condensed matter systems (such as conventional superconductors, superfluids, low-dimensional systems, helimagnets), quantum phase transition, self-organized criticality etc. To learn more about it, I highly recommend Shang-keng Ma’s Modern Theory of Critical Phenomena [Ma 1976] and Mehran Karder’s Statistical Physics of Fields. [Karder 2007]

Rest In Peace!

Deep learning, a collection of related neural network algorithms, has been proved successful in certain types of machine learning tasks in computer vision, speech recognition, data cleaning, and natural language processing (NLP). [Mikolov et. al. 2013] However, it was unclear how deep learning can be so successful. It looks like a black box with messy inputs and excellent outputs. So why is it so successful?

A friend of mine showed me this article in the preprint (arXiv:1410.3831) [Mehta & Schwab 2014] last year, which mathematically shows the equivalence of deep learning and renormalization group (RG). RG is a concept in theoretical physics that has been widely applied in different problems, including critical phenomena, self-organized criticality, particle physics, polymer physics, and strongly correlated electronic systems. And now, Mehta and Schwab showed that an explanation to the performance of deep learning is available through RG.

So what is RG? Before RG, Leo Kadanoff, a physics professor in University of Chicago, proposed an idea of coarse-graining in studying many-body problems in 1966. [Kadanoff 1966] In 1972, Kenneth Wilson and Michael Fisher succeeded in applying ɛ-expansion in perturbative RG to explain the critical exponents in Landau-Ginzburg-Wilson (LGW) Hamiltonian. [Wilson & Fisher 1972] This work has been the standard material of graduate physics courses. In 1974, Kenneth Wilson applied RG to explain the Kondo problem, which led to his Nobel Prize in Physics in 1982. [Wilson 1983]

RG assumes a system of scale invariance, which means the system are similar in whatever scale you are seeing. One example is the chaotic system as in Fig. 1. The system looks the same when you zoom in. We call this scale-invariant system self-similar. And physical systems closed to phase transition are self-similar. And if it is self-similar, Kadanoff’s idea of coarse-graining is then applicable, as in Fig. 2. Four spins can be viewed as one spin that “summarizes” the four spins in that block without changing the description of the physical system. This is somewhat like we “zoom out” the picture on Photoshop or Web Browser.

[Taken from [Singh 2014]]

So what’s the point of zooming out? Physicists care about the Helmholtz free energies of physical systems, which are similar to cost functions to the computer scientists and machine learning specialists. Both are to be minimized. However, whatever scale we are viewing at, the energy of the system should be scale-invariant. Therefore, as we zoom out, the system “changes” yet “looks the same” due to self-similarity, but the energy stays the same. The form of the model is unchanged, but the parameters change as the scale changes.

This is important, because this process tells us which parameters are relevant, and which others are irrelevant. Why? Think of it this way: we have an awesome computer to simulate a glass of water that contains 1023 water molecules. To describe the systems, you have all parameters, including the position of molecules, strength of Van der Waals force, orbital angular momentum of each atom, strength of the covalent bonds, velocities of the molecules… You might have 1025 parameters. However, this awesome computer cannot handle such a system with so many parameters. Then you try to coarse-grain the system, and you discard some parameters in each step of coarse-graining. After numerous steps, it turns out that the temperature and the pressure are the only relevant parameters.

RG helps you identify the relevant parameters.

And it is exactly what happened in deep learning. In each convolutional cycle, features that are not important are gradually discarded, and those that are important are kept and enhanced. Indeed, in computer vision and NLP, the data are so noisy that there are a lot of unnecessary information. Deep learning gradually discards these information. As Mehta and Schwab stated, [Mehta & Schwab 2014]

Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.

So what is the point of understanding this? Unlike other machine algorithms, we did not know how it works, which sometimes makes model building very difficult because we have no idea how to adjust parameters. I believe understanding its equivalence to RG helps guide us to build a model that works.

Charles Martin also wrote a blog entry with more demonstration about the equivalence of deep learning and RG. [Martin 2015]