Michael Kosterlitz, Duncan Haldane, and David J. Thouless are the laureates of Nobel Prize in Physics 2016, “for theoretical discoveries of topological phase transitions and topological phases of matter.” Before Thouless, topology was not known to the physics community. It is a basic knowledge nowadays, however.

I am particularly familiar with Berezinskii-Kosterlitz-Thouless phase transition. What is it? Before that, phase transitions had been studied through the framework of spontaneous symmetry breaking, employing the tools in functional field theory and renormalization group. Matter can be in either disordered state that the symmetry is not broken, or ordered state that a particular continuous symmetry is broken. Near the critical point, many observables found to exhibit long-range order, with $C(r) \sim \frac{1}{r}$, which are so universal that all physical systems described by the same Landau-Ginzburg-Wilson (LGW) model are found to obey it. But in lower dimensions such as d=2, proved by Mermin, Wagner, and Hohenberg, an ordered state is not stable because of its huge fluctuation.

The absence of an ordered state does not exclude the possibility of a phase transition. Berezinskii in 1971, and Kosterlitz and Thouless in 1973, suggested a phase transition that concerns the proliferation of topological objects. While the correlation must be short-ranged, i.e., $C(r) \sim e^{-\frac{r}{\xi}}$, a normal description using LGW model in d=2 does not permit that, unless vortices are considered. However, below a certain temperature, due to energy configuration, it is better for these vortices to be bounded. Then the material exhibits quasi-long-range order, i.e., $C(r) \sim \frac{1}{r^{\alpha}}$. This change in correlation function is a phase transition different from that induced by spontaneous symmetry breaking, but the proliferation of isolated topological solitons.

Their work started the study of topology in condensed matter system. Not long after this work, there was the study of topological defects in various condensed matter system, and then fractional quantum Hall effect, topological insulators, topological quantum computing, A phase in liquid crystals and helimagnets etc. “Topological” is the buzzword in condensed matter physics community nowadays. Recently, there is a preprint article connecting machine learning and topological physical state. (See: arXiv:1609.09060.)

In machine learning, deep learning is the buzzword. However, to understand how these things work, we may need a theory, or we may need to construct our own features if a large amount of data are not available. Then, topological data analysis (TDA) becomes important in the same way as condensed matter physics.

Deep learning, a collection of related neural network algorithms, has been proved successful in certain types of machine learning tasks in computer vision, speech recognition, data cleaning, and natural language processing (NLP). [Mikolov et. al. 2013] However, it was unclear how deep learning can be so successful. It looks like a black box with messy inputs and excellent outputs. So why is it so successful?

A friend of mine showed me this article in the preprint (arXiv:1410.3831) [Mehta & Schwab 2014] last year, which mathematically shows the equivalence of deep learning and renormalization group (RG). RG is a concept in theoretical physics that has been widely applied in different problems, including critical phenomena, self-organized criticality, particle physics, polymer physics, and strongly correlated electronic systems. And now, Mehta and Schwab showed that an explanation to the performance of deep learning is available through RG.

So what is RG? Before RG, Leo Kadanoff, a physics professor in University of Chicago, proposed an idea of coarse-graining in studying many-body problems in 1966. [Kadanoff 1966] In 1972, Kenneth Wilson and Michael Fisher succeeded in applying ɛ-expansion in perturbative RG to explain the critical exponents in Landau-Ginzburg-Wilson (LGW) Hamiltonian. [Wilson & Fisher 1972] This work has been the standard material of graduate physics courses. In 1974, Kenneth Wilson applied RG to explain the Kondo problem, which led to his Nobel Prize in Physics in 1982. [Wilson 1983]

RG assumes a system of scale invariance, which means the system are similar in whatever scale you are seeing. One example is the chaotic system as in Fig. 1. The system looks the same when you zoom in. We call this scale-invariant system self-similar. And physical systems closed to phase transition are self-similar. And if it is self-similar, Kadanoff’s idea of coarse-graining is then applicable, as in Fig. 2. Four spins can be viewed as one spin that “summarizes” the four spins in that block without changing the description of the physical system. This is somewhat like we “zoom out” the picture on Photoshop or Web Browser.

[Taken from [Singh 2014]]

So what’s the point of zooming out? Physicists care about the Helmholtz free energies of physical systems, which are similar to cost functions to the computer scientists and machine learning specialists. Both are to be minimized. However, whatever scale we are viewing at, the energy of the system should be scale-invariant. Therefore, as we zoom out, the system “changes” yet “looks the same” due to self-similarity, but the energy stays the same. The form of the model is unchanged, but the parameters change as the scale changes.

This is important, because this process tells us which parameters are relevant, and which others are irrelevant. Why? Think of it this way: we have an awesome computer to simulate a glass of water that contains 1023 water molecules. To describe the systems, you have all parameters, including the position of molecules, strength of Van der Waals force, orbital angular momentum of each atom, strength of the covalent bonds, velocities of the molecules… You might have 1025 parameters. However, this awesome computer cannot handle such a system with so many parameters. Then you try to coarse-grain the system, and you discard some parameters in each step of coarse-graining. After numerous steps, it turns out that the temperature and the pressure are the only relevant parameters.

RG helps you identify the relevant parameters.

And it is exactly what happened in deep learning. In each convolutional cycle, features that are not important are gradually discarded, and those that are important are kept and enhanced. Indeed, in computer vision and NLP, the data are so noisy that there are a lot of unnecessary information. Deep learning gradually discards these information. As Mehta and Schwab stated, [Mehta & Schwab 2014]

Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.

So what is the point of understanding this? Unlike other machine algorithms, we did not know how it works, which sometimes makes model building very difficult because we have no idea how to adjust parameters. I believe understanding its equivalence to RG helps guide us to build a model that works.

Charles Martin also wrote a blog entry with more demonstration about the equivalence of deep learning and RG. [Martin 2015]