Ever since Mehta and Schwab laid out the relationship between restricted Boltzmann machines (RBM) and deep learning mathematically (see my previous entry), scientists have been discussing why deep learning works so well. Recently, Henry Lin and Max Tegmark put a preprint on arXiv (arXiv:1609.09225), arguing that deep learning works because it captures a few essential physical laws and properties. Tegmark is a cosmologist.

Physical laws are simple in a way that a few properties, such as locality, symmetry, hierarchy etc., lead to large-scale, universal, and often complex phenomena. A lot of machine learning algorithms, including deep learning algorithms, have deep relations with formalisms outlined in statistical mechanics.

A lot of machine learning algorithms are basically probability theory. They outlined a few types of algorithms that seek various types of probabilities. They related the probabilities to Hamiltonians in many-body systems.

They argued why neural networks can approximate functions (polynomials) so well, giving a simple neural network performing multiplication. With central limit theorem or Jaynes’ arguments (see my previous entry), a lot of multiplications, they said, can be approximated by low-order polynomial Hamiltonian. This is like a lot of many-body systems that can be approximated by 4-th order Landau-Ginzburg-Wilson (LGW) functional.

Properties such as locality reduces the number of hyper-parameters needed because it restricts to interactions among close proximities. Symmetry further reduces it, and also computational complexities. Symmetry and second order phase transition make scaling hypothesis possible, leading to the use of the tools such as renormalization group (RG). As many people have been arguing, deep learning resembles RG because it filters out unnecessary information and maps out the crucial features. Tegmark use classifying cats vs. dogs as an example, as in retrieving temperatures of a many-body systems using RG procedure. They gave a counter-example to Schwab’s paper with the probabilities cannot be preserved by RG procedure, but while it is sound, but it is not the point of the RG procedure anyway.

They also discussed about the no-flattening theorems for neural networks.

Leo Kadanoff passed away on October 26, 2015.

Leo Kadanoff is an American physicist in University of Chicago. His most prominent work is the idea of block spin and coarse-graining in statistical physics. [Kadanoff 1966] His work has an enormous impact on second-order phase transition and critical phenomena, based on the knowledge of scale and universality. His idea was further developed into renormalization group (RG), [Wilson 1983] which leads to Ken Wilson awarded with Nobel Prize in Physics in 1982.

The concept of RG has also been used to explain how deep learning works, [Mehta, Schwab 2014] which you can read more about from my previous blog entry and their paper. While only the equivalence between RG and Restricted Boltzmann Machine was rigorously proved, it sheds a lot of insights about how it works, in a way that I believe it is roughly what happens. Without the concept that Kadanoff developed, it is impossible for Mehta and Schwab to make such a connection between critical phenomena and neural network.

He has other contributions such as computational physics, urban planning, computer science, hydrodynamics, biology, applied mathematics and geophysics. He has been awarded with the Wolf Prize in Physics (1980), Elliott Cresson Medal(1986), Lars Onsager Prize (1998), Lorentz Medal (2006), and Isaac Newton Medal (2011).

His work has a significant impact on statistical physics, including problems of second-order phase transition, percolation, various condensed matter systems (such as conventional superconductors, superfluids, low-dimensional systems, helimagnets), quantum phase transition, self-organized criticality etc. To learn more about it, I highly recommend Shang-keng Ma’s Modern Theory of Critical Phenomena [Ma 1976] and Mehran Karder’s Statistical Physics of Fields. [Karder 2007]

Rest In Peace!

There is no doubt that everyone who are in the so-called big data industry must know some statistics. However, statistics means differently to different peoples.

Statistics is an old field that was developed in the 18th century. In those times, people were urged to make conclusions out of a vast amount of data which were virtually not available, or were very costly to obtain. For example, someone wanted to know the average salary of the whole population, which required the census staff to survey the information from everyone in the population. It was something expensive to do in the old days. Therefore, sampling techniques were devised, and the wanted quantities can be estimated using an appropriate statistic.

Or when the scientists performed an experiment, even one data point costs a few million dollars. The experiments had to be designed in a way that the scientists extract the wanted information by looking at a few data points.

Or in testing some hypotheses, one needs to know only how to accept or reject a hypothesis using the statistical information available.

Hence, the traditional statistics is a body of knowledge that deduce the information of a whole population from a limited amount of data from a sample.

Theoretical Statistical Physics

There is a branch in physics called statistical physics, which originated from the 19th century. Later it became useful since Albert Einstein published its paper on Brownian motion in 1905. And now the methods in statistical physics is not only applied in solid state physics or condensed matter physics, but also in biophysics (e.g., diffusion), econophysics (e.g., the fairness and wealth distribution, see this previous blog post), and quantitative finance (e.g., binomial model, and its relation with Black-Scholes equation).

The techniques involved in statistical physics includes is the knowledge of probability theory and stochastic calculus (such as Ito calculus). Of course, it is how entropy, a concept from thermodynamics, entered probability theory and information theory. Extracted quantity are mostly expectation values and correlations, which are of interest to theorists.

This is very different from traditional statistics. When people know that I am a statistical physicist, they expect me to be familiar with t-test, which is not really the case. (Very often I have to look up every time I used them.)

Statistics in the Computing World

Unlike in traditional statistics or statistical physics, nowadays, we often get the statistical information directly from a vast amount of available data, thanks to the advance of technology and the reducing cost to access the technology. You can easily calculate the average salary of a population by a single command line on R or Python. Hence, statistics is no longer about extracting information from a limited amount of data, but a vast amount of data.

On the other hand, mathematical modeling is still important, but in a different sense. Models in statistical physics describes the world, but in information retrieval, models are built according to what we need.

P.S.: Philipp Janert wrote something similar in his Chapter 10 (“What You Really Need to Know About Classical Statistics”) in his “Data Analysis Using Open Source Tools“:

The basic statistical methods that we know today were developed in the late 19th and early 20th centuries, mostly in Great Britain, by a very small group of people. Of those, one worked for the Guinness brewing company and another—the most influential one of them—worked at an agricultural research lab (trying to increase crop yields and the like). This bit of historical context tells us something about their working conditions and primary challenges.

No computational capabilities All computations had to be performed with paper and pencil.

No graphing capabilities, either All graphs had to be generated with pencil, paper, and a ruler. (And complicated graphs—such as those requiring prior transformations or calculations using the data—were especially cumbersome.)

Very small and very expensive data sets Data sets were small (often not more than four to five points) and could be obtained only with great difficulty. (When it always takes a full growing season to generate a new data set, you try very hard to make do with the data you already have!)

In other words, their situation was almost entirely the opposite of our situation today:

• Computational power that is essentially free (within reason)
• Interactive graphing and visualization capabilities on every desktop
• Often huge amounts of data

It should therefore come as no surprise that the methods developed by those early researchers seem so out of place to us: they spent a great amount of effort and ingenuity solving problems we simply no longer have! This realization goes a long way toward explaining why classical statistics is the way it is and why it often seems so strange to us today.

P.S.: The graph at the beginning of this blog entry was plotted in Mathematica, by running the following:

Plot[Evaluate@Table[PDF[MaxwellDistribution[σ], x], {σ, {1, 2, 3}}], {x, 0, 10}, Filling -> Axis]

Entropy is one of the most fascinating ideas in the history of mathematical sciences.

In Phenomenological Thermodynamics…

Entropy was introduced into thermodynamics in the 19th century. Like the free energies, it describes the state of a thermodynamic system. At the beginning, entropy is merely phenomenological. The physicists found it useful to incorporate the description using entropy in the second law of thermodynamics with clarity and simplicity, instead of describing it as convoluted heat flow (which is what it is originally about) among macroscopic systems (say, the heat flow from the hotter pot of water to the air of the room). It did not carry any statistical meaning at all until 1870s.

In Statistical Physics…

Ludwig Boltzmann (1844-1906)

The statistical meaning of entropy was developed by Ludwig Boltzmann, a pioneer of statistical physics, who studied the connection of the macroscopic thermodynamic behavior to the microscopic components of the system. For example, he described the temperature to be the average of the fluctuating kinetic energy of the particles. And he formulated the entropy to be

$S = - k_B \sum_i p_i \log p_i$,

where i is the label for each microstate, and $k_B$ is the Boltzmann’s constant. And in a closed system, the total entropy never decreases.

Information Theory and Statistical Physics United

In statistical physics, Boltzmann’s assumption of equal a priori equilibrium properties is an important assumption. However, in 1957, E. T. Jaynes published a paper relating information theory and statistical physics in Physical Review indicating that merely the principle of maximum entropy is sufficient to describe equilibrium statistical system. [Jaynes 1957] In statistical physics, we are aware that systems can be described as canonical ensemble, or a softmax function (normalized exponential), i.e., $p_i \propto \exp(-\beta E_i)$. This can be easily derived by the principle of maximum entropy and the conservation of energy. Or mathematically, the probabilities for all states i with energies $E_i$ can be obtained by maximizing the entropy

$S = -\sum_i p_i \log p_i$,

under the constraints

$\sum_i p_i = 1$, and
$\sum_i p_i E_i = E$,

where E is a constant. The softmax distribution can be obtained by this simple optimization problem, using basic variational calculus (Euler-Lagrange equation) and Lagrange’s multipliers.

The principle of maximum entropy can be found in statistics too. For example, the form of Gaussian distribution can be obtained by maximizing the entropy

$S = - \int dx \cdot p(x) \log p(x)$,

with the knowledge of the mean $\mu$ and the variance $\sigma^2$, or mathematically speaking, under the constraints,

$\int dx \cdot p(x) = 1$,
$\int dx \cdot x p(x) = \mu$, and
$\int dx \cdot (x-\mu)^2 p(x) = \sigma^2$.

In any statistical systems, the probability distributions can be computed with the principle of maximum entropy, as Jaynes put it [Jaynes 1957]

It is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.

In statistical physics, entropy is roughly a measure how “chaotic” a system is. In information theory, entropy is a measure how surprising the information is. The smaller the entropy is, the more surprising the information is. And it assumes no additional information. Without constraints other than the normalization, the probability distribution is that all $p_i$‘s are equal, which is equivalent to the least surprise. Lê Nguyên Hoang, a scientist at Massachusetts Institute of Technology, wrote a good blog post about the meaning of entropy in information theory. [Hoang 2013] In information theory, the entropy is given by

$S = -\sum_i p_i \log_2 p_i$,

which is different from the thermodynamic entropy by the constant $k_B$ and the coefficient $\log 2$. The entropies in information theory and statistical physics are equivalent.

Entropy in Natural Language Processing (NLP)

The principle of maximum entropy assumes nothing other than the given information to compute the most optimized probability distribution, which makes it a desirable algorithm in machine learning. It can be regarded as a supervised learning algorithm, with the features being ${p, c}$, where p is the property calculated, and c is the class. The probability for ${p, c}$ is proportional to $\exp(- \alpha \text{\#}({p, c}))$, where $\alpha$ is the coefficient to be found during training. There are some technical note to compute all these coefficients, which essentially involves solving a system of algebraic equations numerically using techniques such as generalized iterative scaling (GIS).

Does it really assume no additional information? No. The way you construct the features is how you add information. But once the features are defined, the calculation depends on the training data only.

The classifier based on maximum entropy has found its application in part-of-speech (POS) tagging, machine translation (ML), speech recognition, and text mining. A good review was written by Berger and Della Pietra’s. [Berger, Della Pietra, Della Pietra 1996] A lot of open-source softwares provide maximum entropy classifiers, such as Python NLTK and Apache OpenNLP.

In Quantum Computation…

One last word, entropy is used to describe quantum entanglement. A composite bipartite quantum system is said to be entangled if its subsystems must be described in a mixed state, i.e., it must be statistical if one of the subsystems is only considered. Then the entanglement entropy is given by [Nielssen, Chuang 2011]

$S = -\sum_i p_i \log p_i$,

which is essentially the same formula. The more entangled the system is, the larger the entanglement entropy. However, composite quantum systems tend to decrease their entropy over time though.

Wall Street is not only a place of facilitating the money flow, but also a playground for scientists.

When I was young, I saw one of my uncles plotting prices for stocks to perform technical analysis. When I was in college, my friends often talked about investing in a few financial futures and options. When I was doing my graduate degree in physics, we studied John Hull’s famous textbook [Hull 2011] on quantitative finance to learn about financial modeling. A few of my classmates went to Wall Street to become quantitative analysts or financial software developers. There are ups and downs in the financial markets. But as long as we are in a capitalist society, finance is a subject we never ignore. However, scientists have not come up with a consensus about the nature of a financial market.

Agent-Based Models

Economists believe that individuals in a market are rational being who always aim at maximizing their profits. They often apply agent-based models, which employs complex system theories or game theory.

Random Processes and Statistical Physics

However, a lot of mathematicians in Wall Street (including quantitative analysts and econophysicists) see the stock prices as undergoing Brownian motion. [Hull 2011, Baaquie 2007] They employ tools in statistical physics and stochastic processes to study the pricings of various financial derivatives. Therefore, the random-process and econophysical approaches have nothing much about stock price prediction (despite the fact that they do need a “return rate” in their model.) Random processes are unpredictable.

However, some sort of predictions carry great values. For example, when there is overhypes or bubbles in the market, we want to know when it will burst. There are models that predict defaults and bubble burst in a market using the log-periodic power law (LPPL). [Wosnitza, Denz 2013] In addition, there has been research showing the leverage effect in stock markets in developed countries such as Germany (c.f. fluctuation-dissipation theorem in statistical physics), and anti-leverage effect in China (Shanghai and Shenzhen). [Qiu, Zhen, Ren, Trimper 2006]

Reconciling Intelligence and Randomness

There are some values to both views. It is hard to believe that stock prices are completely random, as the economic environment and the public opinions must affect the stock prices. People can neither be completely rational nor completely random.

There has been some study in reconciling game theory and random processes, in an attempt to bring economists and mathematicians together. In this theoretical framework, financial systems still sought to attain the maximum entropy (randomness), but the “particles” in the system behaves intelligently. [Venkatasubramanian, Luo, Sethuraman 2015] (See my another blog entry: MathAnalytics (1) – Beautiful Mind, Physical Nature and Economic Inequality) We are not sure how successful this attempt will be at this point.

Sentiment Analysis

As people are talking about big data in recent years, there have been attempts to apply machine learning algorithms in finance. However, scientists tend not to price using machine learning algorithms because these algorithms mostly perform classification. However, there are attempts, with natural language processing (NLP) techniques, to predict the stock prices by detecting the public emotions (or sentiments) in social media such as Twitter. [Bollen, Mao, Zeng 2010] It has been found that measuring the public mood in a few dimensions (including Calm, Alert, Sure, Vital, Kind, and Happy) allows scientists to accurately predict the trend of Dow Jones Industrial Average (DJIA). However, some hackers take advantage on the sentiment analysis on Twitter. In 2013, there was a rumor on Twitter saying the White House being bombed, The computers responded instantly and automatically by performing trading, causing the stock market to fall immediately. But the market restored quickly after it was discovered that the news was fake. (Fig. 1)

Fig. 1: DJIA fell because of a rumor of the White House being bombed, but recovered when discovered the news was fake (taken from http://www.rt.com/news/syrian-electronic-army-ap-twitter-349/)

P.S.: While I was writing this, I saw an interesting statement in the paper about leverage effect. [Qiu, Zhen, Ren, Trimper 2006] The authors said that:

Why do the German and Chinese markets exhibit different return-volatility correlations? Germany is a developed country. To some extent, people show risk aversion, and therefore, may be nervous in trading as the stock price is falling. This induces a higher volatility. When the price is rising, people feel safe and are inactive in trading. Thus, the stock price tends to be stable. This should be the social origin of the leverage effect. However, China just experiences the first stage of capitalism, and people are somewhat excessive speculative in the financial markets. Therefore, people rush for trading as the stock price increases. When the price drops, people stay inactive in trading and wait for rising up of the stock price. That explains the antileverage effect.

Does this paragraph written in 2006 give a hint of what happened in China in 2015 now? (Fig. 2)

Fig. 2: The fall of Chinese stock market in 2015 (taken from http://www.economicpolicyjournal.com/2015/06/breaking-biggest-chinese-stock-market.html)

Taken from the movie “Beautiful Mind”

John Nash’s death on May 23, 2015 on the New Jersey Turnpike was a tragedy. However, his contribution to mathematics and economics is everlasting. His contribution to game theory led to his sharing the 1994 Nobel Memorial Prize for Economical Sciences.

Coincidentally, three weeks before his accidental death, there was an econophysics paper that employed his ideas of Nash equilibrium. Econophysics has been an inter-disciplinary quantitative field since 1990s. Victor Yakovenko, a physics professor in University of Maryland, applied the techniques of classical statistical mechanics, and concluded that the wealth of bottom 95% population follows Boltzmann-Gibbs exponential distribution, while the top a Pareto distribution. [Dragulescu & Yakovenko 2000] This approach assumes agents  to have nearly “zero intelligence,” and behave randomly with no intent and purpose, contrary to the conventional assumption in economics that agents are perfectly rational, with purpose to maximize utility or profit.

This paper, written by Venkat Venkatasubramanian, described an approach aiming at reconciling econophysics and conventional economics, using the ideas in game theory. [Venkatasubramanian, Luo  & Sethuraman 2015] Like statistical mechanics, it assumes the agents to be particles. Money plays the role of energy, just like other econophysics theory. The equilibrium state is the state with maximum entropy. However, it employed the idea of game theory, adding that the agents are intelligent and in a game, unlike molecules in traditional statistical mechanics. The equilibrium state is not simply the maximum entropic state, but also the Nash equilibrium. This reconciles econophysics and conventional economics. And it even further argues that, unlike equilibrium in thermodynamics being probabilistic in nature, this economical equilibrium is deterministic. And the expected distribution is log-normal distribution. (This log-normal distribution is hard to fit, which is another obstacles for economists to accept physical approach to economics.)

With this framework, Venkatasubramanian discussed about income inequality. Income inequality has aroused debates in the recent few years, especially after the detrimental financial crisis in 2008. Is capitalism not working now? Does capitalism produce unfairness? He connected entropy with the concept of fairness, or fairest inequality. And the state with maximum entropy is the fairest state. And, of course, the wealth distribution is the log-normal distribution. His study showed that:[http://phys.org/news/2015-05-fair-theory-income-inequality.html]

“Scandinavian countries and, to a lesser extent, Switzerland, Netherlands, and Australia have managed, in practice, to get close to the ideal distribution for the bottom 99% of the population, while the U.S. and U.K. remain less fair at the other extreme. Other European countries such as France and Germany, and Japan and Canada, are in the middle.”

See the figure at the end of this post about the discrepancy of the economies of a few countries to the maximum entropic state, or ideality. And [Venkatasubramanian, Luo  & Sethuraman 2015]

“Even the US economy operated a lot closer to ideality, during ∼1945–75, than it does now. It is important to emphasize that in those three decades US performed extremely well economically, dominating the global economy in almost every sector.”

They even argued that these insights in economics might shed light to traditional statistical thermodynamics.

I have to say that I love this work because not only it explains real-world problem, but also links physics and economics in a beautiful way.