Ever since Mehta and Schwab laid out the relationship between restricted Boltzmann machines (RBM) and deep learning mathematically (see my previous entry), scientists have been discussing why deep learning works so well. Recently, Henry Lin and Max Tegmark put a preprint on arXiv (arXiv:1609.09225), arguing that deep learning works because it captures a few essential physical laws and properties. Tegmark is a cosmologist.
Physical laws are simple in a way that a few properties, such as locality, symmetry, hierarchy etc., lead to large-scale, universal, and often complex phenomena. A lot of machine learning algorithms, including deep learning algorithms, have deep relations with formalisms outlined in statistical mechanics.
A lot of machine learning algorithms are basically probability theory. They outlined a few types of algorithms that seek various types of probabilities. They related the probabilities to Hamiltonians in many-body systems.
They argued why neural networks can approximate functions (polynomials) so well, giving a simple neural network performing multiplication. With central limit theorem or Jaynes’ arguments (see my previous entry), a lot of multiplications, they said, can be approximated by low-order polynomial Hamiltonian. This is like a lot of many-body systems that can be approximated by 4-th order Landau-Ginzburg-Wilson (LGW) functional.
Properties such as locality reduces the number of hyper-parameters needed because it restricts to interactions among close proximities. Symmetry further reduces it, and also computational complexities. Symmetry and second order phase transition make scaling hypothesis possible, leading to the use of the tools such as renormalization group (RG). As many people have been arguing, deep learning resembles RG because it filters out unnecessary information and maps out the crucial features. Tegmark use classifying cats vs. dogs as an example, as in retrieving temperatures of a many-body systems using RG procedure. They gave a counter-example to Schwab’s paper with the probabilities cannot be preserved by RG procedure, but while it is sound, but it is not the point of the RG procedure anyway.
They also discussed about the no-flattening theorems for neural networks.
There are many tasks that involve coding, for example, putting kids into groups according to their age, labeling the webpages about their kinds, or putting students in Hogwarts into four colleges… And researchers or lawyers need to code people, according to their filled-in information, into occupations. Melissa Friesen, an investigator in Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NCI), National Institutes of Health (NIH), saw the need of large-scale coding. Many researchers are dealing with big data concerning epidemiology. She led a research project, in collaboration with Office of Intramural Research (OIR), Center for Information Technology (CIT), National Institutes of Health (NIH), to develop an artificial intelligence system to cope with the problem. This leads to a publicly available tool called SOCcer, an acronym for “Standardized Occupation Coding for Computer-assisted Epidemiological Research.” (URL: http://soccer.nci.nih.gov/soccer/)
The system was initially developed in an attempt to find the correlation between the onset of cancers and other diseases and the occupation. “The application is not intended to replace expert coders, but rather to prioritize which job descriptions would benefit most from expert review,” said Friesen in an interview. She mainly works with Daniel Russ in CIT.
SOCcer takes job title, industry codes (in terms of SIC, Standard Industrial Classification), and job duties, and gives an occupational code called SOC 2010 (Standard Occupational Classification), used by U. S. federal government agencies. The data involves short text, often messy. There are 840 codes in SOC 2010 systems. Conventional natural language processing (NLP) methods may not apply. Friesen, Russ, and Kwan-Yuet (Stephen) Ho (also in OIR, CIT; a CSRA staff) use fuzzy logic, and maximum entropy (maxent) methods, with some feature engineering, to build various classifiers. These classifiers are aggregated together, as in stacked generalization (see my previous entry), using logistic regression, to give a final score.
SOCcer has a companion software, called SOCAssign, for expert coders to prioritize the codings. It was awarded with DCEG Informatics Tool Challenge 2015. SOCcer itself was awarded in 2016. And the SOCcer team was awarded for Scientific Award of Merit by CIT/OCIO in 2016 as well (see this). Their work was published in Occup. Environ. Med.
Justin Trudeau, the Canadian Prime Minister, rocked the world recently by talking about quantum computers in front of reporters at Perimeter Institute…
No fault can be found in his answer, although his answer cannot be more layman. A few physicists were challenged to give a layman introduction about quantum computers, and they are equally well. (See MacLean’s article.)
The research about quantum computers has been for decades, and it is not until recently some commercial products are available in the market.
Qubits and Quantum Entanglement
Trudeau is right that in quantum computers, a bit, called a qubit, is more than just 0 and 1, but a complicated combination of them. It can be characterized by a quantum state , where and are complex numbers. Such a state can be geometrically demonstrated as a Bloch sphere, where the possibility of a state is infinitely more than simply up and down.
Besides qubits, quantum entanglement makes quantum computing unique, because with entanglement it allows algorithms to be calculated in parallel at the same time by its nature. Quantum entanglement concerns the correlation between two qubits in a way that the quantum state of the two qubits cannot be separated into two independent quantum states, each for one qubit. There are many ways to quantify the entanglement of a bipartite state (consisting of two qubits), such as entanglement entropy (similar to Shannon entropy, see this, where the probability is on the partial density matrix calculated by Schmidt decomposition), and negativity. [Peres, 1996]
There are two types of quantum computers in general, namely, the universal quantum computer (UQC), and non-universal adiabatic quantum computer (AQC).
Universal Quantum Computer (UQC)
UQC, the quantum computers that exploit quantum properties to perform computations that are classically infeasible, has been studied for decades in academia. Not only the quantum algorithms, the realization of hardware has always been one of the hot topics of scientific research. Although much time and money has been invested, it is still far from industrial and commercial use.
The first UQC on earth was made in MIT in 1998, by Isaac Chuang and his colleagues, using NMR techniques. [Chuang et. al., 1998] There are also other experimental realization of qubits, such as optical cavity, Josephson junction, quantum dots, nanomechanical resonator etc. Representations depend on the physics systems.
If this is successfully realized in industrialized use, it speeds up a lot of computations by fully applying the quantum algorithms.
Adiabatic Quantum Computer (AQC)
AQC does not fully exploit the quantum properties. While there are qubits, the algorithms may not be fully applied. Instead, it partially exploits quantum properties to accelerate current algorithms. It takes advantage of adiabatic theorem, which states that a slow perturbation to a quantum system remains the instantaneous eigenstate. This means, if the quantum system starts at a ground state, it stays at the ground state after the slow perturbation due to some external field. This process possibly involves quantum tunneling. AQC is actually a type of quantum annealing.
There are already some commercial applications of AQC. D-Wave Systems made their first AQC in the world, and a number of big companies such as Lockheed Martin and Google purchased them. D-Wave systems are based on Josephson junctions. It is programmable, and most suited to optimization problems. One can write their own optimization problem in terms of the Ising model:
The system will start at an initial system with its ground state, and slowly changing the interactions to this Hamiltonian. The state will be the ground state, or the optimized solution, of the problem.
We can immediately see that, in the era of big data, this is very useful as machine learning problems are optimization problems. For example, QxBranch, a startup based on Washington, DC, developed tools using D-Wave Systems to perform some machine learning algorithms.
Topological Quantum Computer
In some sense, an AQC is only a partial quantum computer, which does not employ full quantum properties. However, a UQC is hard to achieve because of quantum decoherence. A state can exist for only a very short time. There exist ways to avoid this, for example, optimal dynamic decoupling. [Fu et. al., 2009] Another fascinating idea to overcome this is topological quantum computing.
A topological quantum computer is a theoretical quantum computer that is based on anyons, a two-dimensional quasi-particles with world lines crossing over in three-dimensional world. It was first introduced in 1997 by Alexei Kitaev. The idea is to exploit the ideas of topology in anyons to preserve the state, because it is extremely costly to change the topology of a system. In 2005, Das Sarma, Freedman, and Nayak proposed using fractional quantum Hall system to achieve this. [Das Sarma et. al., 2006] Some ideas such as Majorana fermions have been proposed too. All in all, topological quantum computing is still a theoretical idea.
My cousin in China bought me this book from China.
The title, Shu Xue Zhi Mei, can be translated literally to “The Beauty of Math,” but the content is on information theory and data mining. The author, Jun Wu, was a scientist in Google at its early stage. He graduated from Tsinghua University and Johns Hopkins University. He is an expert of natural language processing and search engines.
I just started reading this book. But I would like to share the very first section that I read and found very interesting. He told a story about a combination of entropy and information theory beautifully.
A function of languages is to convey information (while the theologians further say that language is related to act, in speech-act theory, in the doctrine of Scripture. See this.) Ancient Egyptians and Chinese invented hieroglyphs, a language system that represents information, which can be seen as clustering in the sense of machine learning. Indeed, a character or a symbol in Chinese do represent an area of meaning. And when we have more concepts, we introduce more characters, or equivalently, add more clusters. It is indeed what has been happening: the Chinese invented new words to cover new knowledge.
Thanks to the Phoenicians, phonetic languages actually reduce the problem of introducing new clusters that require much effort for human to learn. A combination of a small number of letters (or alphabets, or aleph-bets…), together with a set of grammar rules, can represent complicated enough concepts.
Later John von Neumann introduced the concept of information entropy, which is essentially the number of bits (0 or 1) that are required to represent a variety of concepts. See my previous post on entropy. Bit might be the most compact way of representing information, but redundancy in all languages is necessary in case of loss in transmission.
Entropy is one of the most fascinating ideas in the history of mathematical sciences.
In Phenomenological Thermodynamics…
Entropy was introduced into thermodynamics in the 19th century. Like the free energies, it describes the state of a thermodynamic system. At the beginning, entropy is merely phenomenological. The physicists found it useful to incorporate the description using entropy in the second law of thermodynamics with clarity and simplicity, instead of describing it as convoluted heat flow (which is what it is originally about) among macroscopic systems (say, the heat flow from the hotter pot of water to the air of the room). It did not carry any statistical meaning at all until 1870s.
In Statistical Physics…
Ludwig Boltzmann (1844-1906)
The statistical meaning of entropy was developed by Ludwig Boltzmann, a pioneer of statistical physics, who studied the connection of the macroscopic thermodynamic behavior to the microscopic components of the system. For example, he described the temperature to be the average of the fluctuating kinetic energy of the particles. And he formulated the entropy to be
where i is the label for each microstate, and is the Boltzmann’s constant. And in a closed system, the total entropy never decreases.
Information Theory and Statistical Physics United
In statistical physics, Boltzmann’s assumption of equal a priori equilibrium properties is an important assumption. However, in 1957, E. T. Jaynes published a paper relating information theory and statistical physics in Physical Review indicating that merely the principle of maximum entropy is sufficient to describe equilibrium statistical system. [Jaynes 1957] In statistical physics, we are aware that systems can be described as canonical ensemble, or a softmax function (normalized exponential), i.e., . This can be easily derived by the principle of maximum entropy and the conservation of energy. Or mathematically, the probabilities for all states i with energies can be obtained by maximizing the entropy
under the constraints
where E is a constant. The softmax distribution can be obtained by this simple optimization problem, using basic variational calculus (Euler-Lagrange equation) and Lagrange’s multipliers.
The principle of maximum entropy can be found in statistics too. For example, the form of Gaussian distribution can be obtained by maximizing the entropy
with the knowledge of the mean and the variance , or mathematically speaking, under the constraints,
In any statistical systems, the probability distributions can be computed with the principle of maximum entropy, as Jaynes put it [Jaynes 1957]
It is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.
In statistical physics, entropy is roughly a measure how “chaotic” a system is. In information theory, entropy is a measure how surprising the information is. The smaller the entropy is, the more surprising the information is. And it assumes no additional information. Without constraints other than the normalization, the probability distribution is that all ‘s are equal, which is equivalent to the least surprise. Lê Nguyên Hoang, a scientist at Massachusetts Institute of Technology, wrote a good blog post about the meaning of entropy in information theory. [Hoang 2013] In information theory, the entropy is given by
which is different from the thermodynamic entropy by the constant and the coefficient . The entropies in information theory and statistical physics are equivalent.
Entropy in Natural Language Processing (NLP)
The principle of maximum entropy assumes nothing other than the given information to compute the most optimized probability distribution, which makes it a desirable algorithm in machine learning. It can be regarded as a supervised learning algorithm, with the features being , where p is the property calculated, and c is the class. The probability for is proportional to , where is the coefficient to be found during training. There are some technical note to compute all these coefficients, which essentially involves solving a system of algebraic equations numerically using techniques such as generalized iterative scaling (GIS).
Does it really assume no additional information? No. The way you construct the features is how you add information. But once the features are defined, the calculation depends on the training data only.
The classifier based on maximum entropy has found its application in part-of-speech (POS) tagging, machine translation (ML), speech recognition, and text mining. A good review was written by Berger and Della Pietra’s. [Berger, Della Pietra, Della Pietra 1996] A lot of open-source softwares provide maximum entropy classifiers, such as Python NLTK and Apache OpenNLP.
In Quantum Computation…
One last word, entropy is used to describe quantum entanglement. A composite bipartite quantum system is said to be entangled if its subsystems must be described in a mixed state, i.e., it must be statistical if one of the subsystems is only considered. Then the entanglement entropy is given by [Nielssen, Chuang 2011]
which is essentially the same formula. The more entangled the system is, the larger the entanglement entropy. However, composite quantum systems tend to decrease their entropy over time though.
Wall Street is not only a place of facilitating the money flow, but also a playground for scientists.
When I was young, I saw one of my uncles plotting prices for stocks to perform technical analysis. When I was in college, my friends often talked about investing in a few financial futures and options. When I was doing my graduate degree in physics, we studied John Hull’s famous textbook [Hull 2011] on quantitative finance to learn about financial modeling. A few of my classmates went to Wall Street to become quantitative analysts or financial software developers. There are ups and downs in the financial markets. But as long as we are in a capitalist society, finance is a subject we never ignore. However, scientists have not come up with a consensus about the nature of a financial market.
Economists believe that individuals in a market are rational being who always aim at maximizing their profits. They often apply agent-based models, which employs complex system theories or game theory.
Random Processes and Statistical Physics
However, a lot of mathematicians in Wall Street (including quantitative analysts and econophysicists) see the stock prices as undergoing Brownian motion. [Hull 2011, Baaquie 2007] They employ tools in statistical physics and stochastic processes to study the pricings of various financial derivatives. Therefore, the random-process and econophysical approaches have nothing much about stock price prediction (despite the fact that they do need a “return rate” in their model.) Random processes are unpredictable.
However, some sort of predictions carry great values. For example, when there is overhypes or bubbles in the market, we want to know when it will burst. There are models that predict defaults and bubble burst in a market using the log-periodic power law (LPPL). [Wosnitza, Denz 2013] In addition, there has been research showing the leverage effect in stock markets in developed countries such as Germany (c.f. fluctuation-dissipation theorem in statistical physics), and anti-leverage effect in China (Shanghai and Shenzhen). [Qiu, Zhen, Ren, Trimper 2006]
Reconciling Intelligence and Randomness
There are some values to both views. It is hard to believe that stock prices are completely random, as the economic environment and the public opinions must affect the stock prices. People can neither be completely rational nor completely random.
There has been some study in reconciling game theory and random processes, in an attempt to bring economists and mathematicians together. In this theoretical framework, financial systems still sought to attain the maximum entropy (randomness), but the “particles” in the system behaves intelligently. [Venkatasubramanian, Luo, Sethuraman 2015] (See my another blog entry: MathAnalytics (1) – Beautiful Mind, Physical Nature and Economic Inequality) We are not sure how successful this attempt will be at this point.
As people are talking about big data in recent years, there have been attempts to apply machine learning algorithms in finance. However, scientists tend not to price using machine learning algorithms because these algorithms mostly perform classification. However, there are attempts, with natural language processing (NLP) techniques, to predict the stock prices by detecting the public emotions (or sentiments) in social media such as Twitter. [Bollen, Mao, Zeng 2010] It has been found that measuring the public mood in a few dimensions (including Calm, Alert, Sure, Vital, Kind, and Happy) allows scientists to accurately predict the trend of Dow Jones Industrial Average (DJIA). However, some hackers take advantage on the sentiment analysis on Twitter. In 2013, there was a rumor on Twitter saying the White House being bombed, The computers responded instantly and automatically by performing trading, causing the stock market to fall immediately. But the market restored quickly after it was discovered that the news was fake. (Fig. 1)
P.S.: While I was writing this, I saw an interesting statement in the paper about leverage effect. [Qiu, Zhen, Ren, Trimper 2006] The authors said that:
Why do the German and Chinese markets exhibit different return-volatility correlations? Germany is a developed country. To some extent, people show risk aversion, and therefore, may be nervous in trading as the stock price is falling. This induces a higher volatility. When the price is rising, people feel safe and are inactive in trading. Thus, the stock price tends to be stable. This should be the social origin of the leverage effect. However, China just experiences the first stage of capitalism, and people are somewhat excessive speculative in the financial markets. Therefore, people rush for trading as the stock price increases. When the price drops, people stay inactive in trading and wait for rising up of the stock price. That explains the antileverage effect.
Does this paragraph written in 2006 give a hint of what happened in China in 2015 now? (Fig. 2)
John Nash’s death on May 23, 2015 on the New Jersey Turnpike was a tragedy. However, his contribution to mathematics and economics is everlasting. His contribution to game theory led to his sharing the 1994 Nobel Memorial Prize for Economical Sciences.
Coincidentally, three weeks before his accidental death, there was an econophysics paper that employed his ideas of Nash equilibrium. Econophysics has been an inter-disciplinary quantitative field since 1990s. Victor Yakovenko, a physics professor in University of Maryland, applied the techniques of classical statistical mechanics, and concluded that the wealth of bottom 95% population follows Boltzmann-Gibbs exponential distribution, while the top a Pareto distribution. [Dragulescu & Yakovenko 2000] This approach assumes agents to have nearly “zero intelligence,” and behave randomly with no intent and purpose, contrary to the conventional assumption in economics that agents are perfectly rational, with purpose to maximize utility or profit.
This paper, written by Venkat Venkatasubramanian, described an approach aiming at reconciling econophysics and conventional economics, using the ideas in game theory. [Venkatasubramanian, Luo & Sethuraman 2015] Like statistical mechanics, it assumes the agents to be particles. Money plays the role of energy, just like other econophysics theory. The equilibrium state is the state with maximum entropy. However, it employed the idea of game theory, adding that the agents are intelligent and in a game, unlike molecules in traditional statistical mechanics. The equilibrium state is not simply the maximum entropic state, but also the Nash equilibrium. This reconciles econophysics and conventional economics. And it even further argues that, unlike equilibrium in thermodynamics being probabilistic in nature, this economical equilibrium is deterministic. And the expected distribution is log-normal distribution. (This log-normal distribution is hard to fit, which is another obstacles for economists to accept physical approach to economics.)
With this framework, Venkatasubramanian discussed about income inequality. Income inequality has aroused debates in the recent few years, especially after the detrimental financial crisis in 2008. Is capitalism not working now? Does capitalism produce unfairness? He connected entropy with the concept of fairness, or fairest inequality. And the state with maximum entropy is the fairest state. And, of course, the wealth distribution is the log-normal distribution. His study showed that:[http://phys.org/news/2015-05-fair-theory-income-inequality.html]
“Scandinavian countries and, to a lesser extent, Switzerland, Netherlands, and Australia have managed, in practice, to get close to the ideal distribution for the bottom 99% of the population, while the U.S. and U.K. remain less fair at the other extreme. Other European countries such as France and Germany, and Japan and Canada, are in the middle.”
See the figure at the end of this post about the discrepancy of the economies of a few countries to the maximum entropic state, or ideality. And [Venkatasubramanian, Luo & Sethuraman 2015]
“Even the US economy operated a lot closer to ideality, during ∼1945–75, than it does now. It is important to emphasize that in those three decades US performed extremely well economically, dominating the global economy in almost every sector.”
They even argued that these insights in economics might shed light to traditional statistical thermodynamics.
I have to say that I love this work because not only it explains real-world problem, but also links physics and economics in a beautiful way.