There is no doubt that everyone who are in the so-called big data industry must know some statistics. However, statistics means differently to different peoples.

Statistics is an old field that was developed in the 18th century. In those times, people were urged to make conclusions out of a vast amount of data which were virtually not available, or were very costly to obtain. For example, someone wanted to know the average salary of the whole population, which required the census staff to survey the information from everyone in the population. It was something expensive to do in the old days. Therefore, sampling techniques were devised, and the wanted quantities can be estimated using an appropriate statistic.

Or when the scientists performed an experiment, even one data point costs a few million dollars. The experiments had to be designed in a way that the scientists extract the wanted information by looking at a few data points.

Or in testing some hypotheses, one needs to know only how to accept or reject a hypothesis using the statistical information available.

Hence, the traditional statistics is a body of knowledge that deduce the information of a whole population from a limited amount of data from a sample.

Theoretical Statistical Physics

There is a branch in physics called statistical physics, which originated from the 19th century. Later it became useful since Albert Einstein published its paper on Brownian motion in 1905. And now the methods in statistical physics is not only applied in solid state physics or condensed matter physics, but also in biophysics (e.g., diffusion), econophysics (e.g., the fairness and wealth distribution, see this previous blog post), and quantitative finance (e.g., binomial model, and its relation with Black-Scholes equation).

The techniques involved in statistical physics includes is the knowledge of probability theory and stochastic calculus (such as Ito calculus). Of course, it is how entropy, a concept from thermodynamics, entered probability theory and information theory. Extracted quantity are mostly expectation values and correlations, which are of interest to theorists.

This is very different from traditional statistics. When people know that I am a statistical physicist, they expect me to be familiar with t-test, which is not really the case. (Very often I have to look up every time I used them.)

Statistics in the Computing World

Unlike in traditional statistics or statistical physics, nowadays, we often get the statistical information directly from a vast amount of available data, thanks to the advance of technology and the reducing cost to access the technology. You can easily calculate the average salary of a population by a single command line on R or Python. Hence, statistics is no longer about extracting information from a limited amount of data, but a vast amount of data.

On the other hand, mathematical modeling is still important, but in a different sense. Models in statistical physics describes the world, but in information retrieval, models are built according to what we need.

P.S.: Philipp Janert wrote something similar in his Chapter 10 (“What You Really Need to Know About Classical Statistics”) in his “Data Analysis Using Open Source Tools“:

The basic statistical methods that we know today were developed in the late 19th and early 20th centuries, mostly in Great Britain, by a very small group of people. Of those, one worked for the Guinness brewing company and another—the most influential one of them—worked at an agricultural research lab (trying to increase crop yields and the like). This bit of historical context tells us something about their working conditions and primary challenges.

No computational capabilities All computations had to be performed with paper and pencil.

No graphing capabilities, either All graphs had to be generated with pencil, paper, and a ruler. (And complicated graphs—such as those requiring prior transformations or calculations using the data—were especially cumbersome.)

Very small and very expensive data sets Data sets were small (often not more than four to five points) and could be obtained only with great difficulty. (When it always takes a full growing season to generate a new data set, you try very hard to make do with the data you already have!)

In other words, their situation was almost entirely the opposite of our situation today:

• Computational power that is essentially free (within reason)
• Interactive graphing and visualization capabilities on every desktop
• Often huge amounts of data

It should therefore come as no surprise that the methods developed by those early researchers seem so out of place to us: they spent a great amount of effort and ingenuity solving problems we simply no longer have! This realization goes a long way toward explaining why classical statistics is the way it is and why it often seems so strange to us today.

P.S.: The graph at the beginning of this blog entry was plotted in Mathematica, by running the following:

Plot[Evaluate@Table[PDF[MaxwellDistribution[σ], x], {σ, {1, 2, 3}}], {x, 0, 10}, Filling -> Axis]

In my very first class to introductory college physics, I was told that a good scientific model have a general descriptive power. While it is true, I found that it is a oversimplified statement after I was exposed to computational science and other fields outside physics.

Descriptive power is important, as in Shelling model in economics; but to many scientists and engineers, a model is devised because of its predictive power, which can be seen as one aspect of its descriptive power. Predictive power is a useful feature. All physics and engineering models have predictive power in a quantitative sense.

Physics models have to be descriptive in a sense that the models are describing physical things; but in the big data era, a lot of machine learning models are like black boxes, which means we care only about the meaning of inputs and outputs, but the content of the models do not necessarily carry a meaning. SVM and deep learning are good examples. (This in fact bothers some people.) But of course, in physics, there are a lot of phenomenological models that are between descriptive models and black boxes, such as the Ginzburg-Landau-Wilson (GLW) model widely used in magnets, superfluids, helimagnets, superconductors, liquid crystals etc. However, to be fair, some machine learning models are quite descriptive, such as clustering, Gaussian mixtures, MaxEnt etc.

A lot of traditional physics models are equation-based, but computational models are not. The reasons are evident. And of course, traditional physics models are mostly continuous while computational models are sometimes discrete. If the computational models are continuous and equation-based, they will be translated to discrete version for the computing machines to handle correctly.

However, whichever forms the models take, we, human beings, are essential to give the models meaning so that they are useful.

Wall Street is not only a place of facilitating the money flow, but also a playground for scientists.

When I was young, I saw one of my uncles plotting prices for stocks to perform technical analysis. When I was in college, my friends often talked about investing in a few financial futures and options. When I was doing my graduate degree in physics, we studied John Hull’s famous textbook [Hull 2011] on quantitative finance to learn about financial modeling. A few of my classmates went to Wall Street to become quantitative analysts or financial software developers. There are ups and downs in the financial markets. But as long as we are in a capitalist society, finance is a subject we never ignore. However, scientists have not come up with a consensus about the nature of a financial market.

Agent-Based Models

Economists believe that individuals in a market are rational being who always aim at maximizing their profits. They often apply agent-based models, which employs complex system theories or game theory.

Random Processes and Statistical Physics

However, a lot of mathematicians in Wall Street (including quantitative analysts and econophysicists) see the stock prices as undergoing Brownian motion. [Hull 2011, Baaquie 2007] They employ tools in statistical physics and stochastic processes to study the pricings of various financial derivatives. Therefore, the random-process and econophysical approaches have nothing much about stock price prediction (despite the fact that they do need a “return rate” in their model.) Random processes are unpredictable.

However, some sort of predictions carry great values. For example, when there is overhypes or bubbles in the market, we want to know when it will burst. There are models that predict defaults and bubble burst in a market using the log-periodic power law (LPPL). [Wosnitza, Denz 2013] In addition, there has been research showing the leverage effect in stock markets in developed countries such as Germany (c.f. fluctuation-dissipation theorem in statistical physics), and anti-leverage effect in China (Shanghai and Shenzhen). [Qiu, Zhen, Ren, Trimper 2006]

Reconciling Intelligence and Randomness

There are some values to both views. It is hard to believe that stock prices are completely random, as the economic environment and the public opinions must affect the stock prices. People can neither be completely rational nor completely random.

There has been some study in reconciling game theory and random processes, in an attempt to bring economists and mathematicians together. In this theoretical framework, financial systems still sought to attain the maximum entropy (randomness), but the “particles” in the system behaves intelligently. [Venkatasubramanian, Luo, Sethuraman 2015] (See my another blog entry: MathAnalytics (1) – Beautiful Mind, Physical Nature and Economic Inequality) We are not sure how successful this attempt will be at this point.

Sentiment Analysis

As people are talking about big data in recent years, there have been attempts to apply machine learning algorithms in finance. However, scientists tend not to price using machine learning algorithms because these algorithms mostly perform classification. However, there are attempts, with natural language processing (NLP) techniques, to predict the stock prices by detecting the public emotions (or sentiments) in social media such as Twitter. [Bollen, Mao, Zeng 2010] It has been found that measuring the public mood in a few dimensions (including Calm, Alert, Sure, Vital, Kind, and Happy) allows scientists to accurately predict the trend of Dow Jones Industrial Average (DJIA). However, some hackers take advantage on the sentiment analysis on Twitter. In 2013, there was a rumor on Twitter saying the White House being bombed, The computers responded instantly and automatically by performing trading, causing the stock market to fall immediately. But the market restored quickly after it was discovered that the news was fake. (Fig. 1)

Fig. 1: DJIA fell because of a rumor of the White House being bombed, but recovered when discovered the news was fake (taken from http://www.rt.com/news/syrian-electronic-army-ap-twitter-349/)

P.S.: While I was writing this, I saw an interesting statement in the paper about leverage effect. [Qiu, Zhen, Ren, Trimper 2006] The authors said that:

Why do the German and Chinese markets exhibit different return-volatility correlations? Germany is a developed country. To some extent, people show risk aversion, and therefore, may be nervous in trading as the stock price is falling. This induces a higher volatility. When the price is rising, people feel safe and are inactive in trading. Thus, the stock price tends to be stable. This should be the social origin of the leverage effect. However, China just experiences the first stage of capitalism, and people are somewhat excessive speculative in the financial markets. Therefore, people rush for trading as the stock price increases. When the price drops, people stay inactive in trading and wait for rising up of the stock price. That explains the antileverage effect.

Does this paragraph written in 2006 give a hint of what happened in China in 2015 now? (Fig. 2)

Fig. 2: The fall of Chinese stock market in 2015 (taken from http://www.economicpolicyjournal.com/2015/06/breaking-biggest-chinese-stock-market.html)

Taken from the movie “Beautiful Mind”

John Nash’s death on May 23, 2015 on the New Jersey Turnpike was a tragedy. However, his contribution to mathematics and economics is everlasting. His contribution to game theory led to his sharing the 1994 Nobel Memorial Prize for Economical Sciences.

Coincidentally, three weeks before his accidental death, there was an econophysics paper that employed his ideas of Nash equilibrium. Econophysics has been an inter-disciplinary quantitative field since 1990s. Victor Yakovenko, a physics professor in University of Maryland, applied the techniques of classical statistical mechanics, and concluded that the wealth of bottom 95% population follows Boltzmann-Gibbs exponential distribution, while the top a Pareto distribution. [Dragulescu & Yakovenko 2000] This approach assumes agents  to have nearly “zero intelligence,” and behave randomly with no intent and purpose, contrary to the conventional assumption in economics that agents are perfectly rational, with purpose to maximize utility or profit.

This paper, written by Venkat Venkatasubramanian, described an approach aiming at reconciling econophysics and conventional economics, using the ideas in game theory. [Venkatasubramanian, Luo  & Sethuraman 2015] Like statistical mechanics, it assumes the agents to be particles. Money plays the role of energy, just like other econophysics theory. The equilibrium state is the state with maximum entropy. However, it employed the idea of game theory, adding that the agents are intelligent and in a game, unlike molecules in traditional statistical mechanics. The equilibrium state is not simply the maximum entropic state, but also the Nash equilibrium. This reconciles econophysics and conventional economics. And it even further argues that, unlike equilibrium in thermodynamics being probabilistic in nature, this economical equilibrium is deterministic. And the expected distribution is log-normal distribution. (This log-normal distribution is hard to fit, which is another obstacles for economists to accept physical approach to economics.)

With this framework, Venkatasubramanian discussed about income inequality. Income inequality has aroused debates in the recent few years, especially after the detrimental financial crisis in 2008. Is capitalism not working now? Does capitalism produce unfairness? He connected entropy with the concept of fairness, or fairest inequality. And the state with maximum entropy is the fairest state. And, of course, the wealth distribution is the log-normal distribution. His study showed that:[http://phys.org/news/2015-05-fair-theory-income-inequality.html]

“Scandinavian countries and, to a lesser extent, Switzerland, Netherlands, and Australia have managed, in practice, to get close to the ideal distribution for the bottom 99% of the population, while the U.S. and U.K. remain less fair at the other extreme. Other European countries such as France and Germany, and Japan and Canada, are in the middle.”

See the figure at the end of this post about the discrepancy of the economies of a few countries to the maximum entropic state, or ideality. And [Venkatasubramanian, Luo  & Sethuraman 2015]

“Even the US economy operated a lot closer to ideality, during ∼1945–75, than it does now. It is important to emphasize that in those three decades US performed extremely well economically, dominating the global economy in almost every sector.”

They even argued that these insights in economics might shed light to traditional statistical thermodynamics.

I have to say that I love this work because not only it explains real-world problem, but also links physics and economics in a beautiful way.

This is an age of quantification, meaning that we want to give everything, even qualitative, a number. In schools, teachers measure how good their students master mathematics by grading, or scoring their homework. The funding agencies measure how good a scientist is by counting the number of his publications, the citations, and the impact factors. We measure how successful a person is by his annual income. We can question all these approaches of measurement. Yet however good or bad the measures are, we look for a metric to measure.

Original PageRank Algorithm

We measure webpages too. In the early ages of Internet, people performed searching on sites such as Yahoo or AltaVista. The keywords they entered are the main information for the browser to do the searching. However, a big problem was that a large number of low quality or irrelevant webpages showed up in search results. Some were due to malicious manipulation of keyword tricks. Therefore, it gave rise a need to rank the webpages. Larry Page and Sergey Brin, the founders of Google, tackled this problem as a thesis topic in Stanford University. But this got commercialized, and Brin never received his Ph.D. They published their algorithm, called PageRank, named after Larry Page, at the Seventh International World Wide Web Conference (WWW7) in April 1998. [Brin & Page 1998] This algorithm is regarded as one of the top ten algorithms in data mining by a survey paper published in the IEEE International Conference on Data Mining (ICDM) in December 2006. [Wu et. al. 2008]

Larry Page and Sergey Brin (source)

The idea of the PageRank algorithm is very simple. It regards each webpage as a node, and each link in the webpage as a directional edge from the source to the target webpage. This forms a network, or a directed graph, of webpages connected by their links. A link is seen as a vote to the target homepage, and if the source homepage ranks high, it enhances the target homepage’s ranking as well. Mathematically it involves solving a large matrix using Newton-Raphson’s method. (Technologies involving handling the large matrix led to the MapReduce programming paradigm, another big data trend nowadays.)

Example (made by Python with packages networkx and matplotlib)

Let’s have an intuition through an example. In the network, we can easily see that “Big Data 1” has the highest rank because it has the most edges pointing to it. However, there are pages such as “Big Data Fake 1,” which looks like a big data page, but in fact it points to “Porn 1.” After running the PageRank algorithm, it does not have a high rank. The sample of the output is:

```[('Big Data 1', 0.00038399273501500979),
('Artificial Intelligence', 0.00034612564364377323),
('Deep Learning 1', 0.00034221161094691966),
('Machine Learning 1', 0.00034177713235138173),
('Porn 1', 0.00033859136614724074),
('Big Data 2', 0.00033182629176238337),
('Spark', 0.0003305912073357307),
('Dow-Jones 1', 0.00032368956852396916),
('Big Data 3', 0.00030969537721207128),
('Porn 2', 0.00030969537721207128),
('Big Data Fake 1', 0.00030735245262038724),
('Dow-Jones 2', 0.00030461420169420618),
('Machine Learning 2', 0.0003011838672138951),
('Deep Learning 2', 0.00029899313444392865),
('Econophysics', 0.00029810944592071552),
('Big Data Fake 2', 0.00029248837867043803),
('Wall Street', 0.00029248837867043803),
('Deep Learning 3', 0.00029248837867043803)]```

You can see those pornographic webpages that pretend to be big data webpages do not have rank as high as those authentic ones. PageRank fights against spam and irrelevant webpages. Google later further improved the algorithm to combat more advanced tricks of spam pages.

You can refer other details in various sources and textbooks. [Rajaraman and Ullman 2011, Wu et. al. 2008]

Use in Social Media and Forums

Mathematically, the PageRank algorithm deals with a directional graph. As one can imagine, any systems that can be modeled as directional graph allow rooms for applying the PageRank algorithm. One extension of PageRank is ExpertiseRank.

Jun Zhang, Mark Ackerman and Lada Adamic published a conference paper in the International World Wide Web (WWW7) in May 2007. [Zhang, Ackerman & Adamic 2007] They investigated into a Java forum, by connecting users to posts and anyone replying to it as a directional graph. With an algorithm closely resembled PageRank, they found the experts and influential people in the forum.

Graphs in ExpertiseRank (take from [Zhang, Ackerman & Adamic 2007])

There are other algorithms like HITS (Hypertext induced topic selection) that does similar things. And social media such as Quora (and its Chinese counterpart, Zhihu) applied a link analysis algorithm (probabilistic topic network, see this.) to perform topic network building. Similar ideas are also applied to identify high-quality content in Yahoo! Answers. [Agichtein, Castillo, Donato, Gionis & Mishne 2008]

Use in Finance and Econophysics

PageRank algorithm is also applied outside information technology fields. Financial engineers and econophysicists applied an algorithm, called DebtRank, which is very similar to PageRank, to determine the systemically important financial institutions in a financial network. This work is published in Nature Scientific Reports. [Battiston, Puliga, Kaushik, Tasca & Caldarelli 2012] In their study, each node represents a financial institution, and a directional edge means the estimated potential impact of an institution to another one. Using DebtRank, we are able to identify the centrally important institutions that potentially impacted other institutions in the network once a financial crisis occurs.

D
ebtRank network, taken from [Battiston, Puliga, Kaushik, Tasca & Caldarelli 2012])