Entropy is one of the most fascinating ideas in the history of mathematical sciences.

In Phenomenological Thermodynamics…

Entropy was introduced into thermodynamics in the 19th century. Like the free energies, it describes the state of a thermodynamic system. At the beginning, entropy is merely phenomenological. The physicists found it useful to incorporate the description using entropy in the second law of thermodynamics with clarity and simplicity, instead of describing it as convoluted heat flow (which is what it is originally about) among macroscopic systems (say, the heat flow from the hotter pot of water to the air of the room). It did not carry any statistical meaning at all until 1870s.

In Statistical Physics… Ludwig Boltzmann (1844-1906)

The statistical meaning of entropy was developed by Ludwig Boltzmann, a pioneer of statistical physics, who studied the connection of the macroscopic thermodynamic behavior to the microscopic components of the system. For example, he described the temperature to be the average of the fluctuating kinetic energy of the particles. And he formulated the entropy to be $S = - k_B \sum_i p_i \log p_i$,

where i is the label for each microstate, and $k_B$ is the Boltzmann’s constant. And in a closed system, the total entropy never decreases.

Information Theory and Statistical Physics United

In statistical physics, Boltzmann’s assumption of equal a priori equilibrium properties is an important assumption. However, in 1957, E. T. Jaynes published a paper relating information theory and statistical physics in Physical Review indicating that merely the principle of maximum entropy is sufficient to describe equilibrium statistical system. [Jaynes 1957] In statistical physics, we are aware that systems can be described as canonical ensemble, or a softmax function (normalized exponential), i.e., $p_i \propto \exp(-\beta E_i)$. This can be easily derived by the principle of maximum entropy and the conservation of energy. Or mathematically, the probabilities for all states i with energies $E_i$ can be obtained by maximizing the entropy $S = -\sum_i p_i \log p_i$,

under the constraints $\sum_i p_i = 1$, and $\sum_i p_i E_i = E$,

where E is a constant. The softmax distribution can be obtained by this simple optimization problem, using basic variational calculus (Euler-Lagrange equation) and Lagrange’s multipliers.

The principle of maximum entropy can be found in statistics too. For example, the form of Gaussian distribution can be obtained by maximizing the entropy $S = - \int dx \cdot p(x) \log p(x)$,

with the knowledge of the mean $\mu$ and the variance $\sigma^2$, or mathematically speaking, under the constraints, $\int dx \cdot p(x) = 1$, $\int dx \cdot x p(x) = \mu$, and $\int dx \cdot (x-\mu)^2 p(x) = \sigma^2$.

In any statistical systems, the probability distributions can be computed with the principle of maximum entropy, as Jaynes put it [Jaynes 1957]

It is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.

In statistical physics, entropy is roughly a measure how “chaotic” a system is. In information theory, entropy is a measure how surprising the information is. The smaller the entropy is, the more surprising the information is. And it assumes no additional information. Without constraints other than the normalization, the probability distribution is that all $p_i$‘s are equal, which is equivalent to the least surprise. Lê Nguyên Hoang, a scientist at Massachusetts Institute of Technology, wrote a good blog post about the meaning of entropy in information theory. [Hoang 2013] In information theory, the entropy is given by $S = -\sum_i p_i \log_2 p_i$,

which is different from the thermodynamic entropy by the constant $k_B$ and the coefficient $\log 2$. The entropies in information theory and statistical physics are equivalent.

Entropy in Natural Language Processing (NLP)

The principle of maximum entropy assumes nothing other than the given information to compute the most optimized probability distribution, which makes it a desirable algorithm in machine learning. It can be regarded as a supervised learning algorithm, with the features being ${p, c}$, where p is the property calculated, and c is the class. The probability for ${p, c}$ is proportional to $\exp(- \alpha \text{\#}({p, c}))$, where $\alpha$ is the coefficient to be found during training. There are some technical note to compute all these coefficients, which essentially involves solving a system of algebraic equations numerically using techniques such as generalized iterative scaling (GIS).

Does it really assume no additional information? No. The way you construct the features is how you add information. But once the features are defined, the calculation depends on the training data only.

The classifier based on maximum entropy has found its application in part-of-speech (POS) tagging, machine translation (ML), speech recognition, and text mining. A good review was written by Berger and Della Pietra’s. [Berger, Della Pietra, Della Pietra 1996] A lot of open-source softwares provide maximum entropy classifiers, such as Python NLTK and Apache OpenNLP.

In Quantum Computation…

One last word, entropy is used to describe quantum entanglement. A composite bipartite quantum system is said to be entangled if its subsystems must be described in a mixed state, i.e., it must be statistical if one of the subsystems is only considered. Then the entanglement entropy is given by [Nielssen, Chuang 2011] $S = -\sum_i p_i \log p_i$,

which is essentially the same formula. The more entangled the system is, the larger the entanglement entropy. However, composite quantum systems tend to decrease their entropy over time though.

In my previous blog post, I introduced the newly emerged topological data analysis (TDA). Unlike most of the other data analytic algorithms, TDA, concerning the topology as its name tells, cares for the connectivity of points, instead of the distance (according to a metric, whether it is Euclidean, Manhattan, Minkowski or any other). What is the best tools to describe topology?

Physicists use a lot homotopy. But for the sake of computation, it is better to use a scheme that are suited for discrete computation. It turns out that there are useful tools in algebraic topology: homology. But to understand homology, we need to understand what a simplicial complex is.

Gunnar Carlsson [Carlsson 2009] and Afra Zomorodian [Zomorodian 2011] wrote good reviews about them, although from a different path in introducing the concept. I first followed Zomorodian’s review [Zomorodian 2011], then his book [Zomorodian 2009] that filled in a lot of missing links in his review, to a certain point. I recently started reading Carlsson’s review.

One must first understand what a simplicial complex is. Without giving too much technical details, simplicial complex is basically a shape connecting points together. A line is a 1-simplex, connecting two points. A triangle is a 2-simplex. A tetrahedron is a 3-complex. There are other more complicated and unnamed complexes. Any subsets of a simplicial complex are faces. For example, the sides of the triangle are faces. The faces and the sides are the faces of the tetrahedron. (Refer to Wolfram MathWorld for more details. There are a lot of good tutorials online.)

Implementing Simplicial Complex

We can easily encoded this into a python code. I wrote a class SimplicialComplex in Python to implement this. We first import necessary libraries:

import numpy as np
from itertools import combinations
from scipy.sparse import dok_matrix


The first line imports the numpy library, the second the iteration tools necessary for extracting the faces for simplicial complex, the third the sparse matrix implementation in the scipy library (applied on something that I will not go over in this blog entry), and the fourth for some reduce operation.

We want to describe the simplicial complexes in the order of some labels (which can be anything, such as integers or strings). If it is a point, then it can be represented as tuples, as below:

 (1,)

Or if it is a line (a 1-simplex), then

 (1, 2)

Or a 2-simplex as a triangle, then

 (1, 2, 3)

I think you get the gist. The integers 1, 2, or 3 here are simply labels. We can easily store this in the class:

class SimplicialComplex:
def __init__(self, simplices=[]):
self.import_simplices(simplices=simplices)

def import_simplices(self, simplices=[]):
self.simplices = map(lambda simplex: tuple(sorted(simplex)), simplices)
self.face_set = self.faces()


You might observe the last line of the codes above. And it is for calculating all the faces of this complex, and it is implemented in this way:

  def faces(self):
faceset = set()
for simplex in self.simplices:
numnodes = len(simplex)
for r in range(numnodes, 0, -1):
for face in combinations(simplex, r):
return faceset


The faces are intuitively sides of a 2D shape (2-simplex), or planes of a 3D shape (3-simplex). But the faces of a 3-simplex includes the faces of all its faces. All the faces are saved in a field called faceset. If the user wants to retrieve the faces in a particular dimension, they can call this method:

  def n_faces(self, n):
return filter(lambda face: len(face)==n+1, self.face_set)


There are other methods that I am not going over in this blog entry. Now let us demonstrate how to use the class by implementing a tetrahedron.

sc = SimplicialComplex([('a', 'b', 'c', 'd')])


If we want to extract the faces, then enter:

sc.faces()


which outputs:

{('a',),
('a', 'b'),
('a', 'b', 'c'),
('a', 'b', 'c', 'd'),
('a', 'b', 'd'),
('a', 'c'),
('a', 'c', 'd'),
('a', 'd'),
('b',),
('b', 'c'),
('b', 'c', 'd'),
('b', 'd'),
('c',),
('c', 'd'),
('d',)}


We have gone over the basis of simplicial complex, which is the foundation of TDA. We appreciate that the simplicial complex deals only with the connectivity of points instead of the distances between the points. And the homology groups will be calculated based on this. However, how do we obtain the simplicial complex from the discrete data we have? Zomorodian’s review [Zomorodian 2011] gave a number of examples, but I will only go through two of them only. And from this, you can see that to establish the connectivity between points, we still need to apply some sort of distance metrics.

Alpha Complex

An alpha complex is the nerve of the cover of the restricted Voronoi regions. (Refer the details to Zomorodian’s review [Zomorodian 2011], this Wolfram MathWorld entry, or this Wolfram Demonstration.) We can extend the class SimplicialComplex to get a class AlphaComplex:

from scipy.spatial import Delaunay, distance
from operator import or_
from functools import partial

def facesiter(simplex):
for i in range(len(simplex)):
yield simplex[:i]+simplex[(i+1):]

def flattening_simplex(simplices):
for simplex in simplices:
for point in simplex:
yield point

def get_allpoints(simplices):
return set(flattening_simplex(simplices))

def contain_detachededges(simplex, distdict, epsilon):
if len(simplex)==2:
return (distdict[simplex, simplex] &gt; 2*epsilon)
else:
return reduce(or_, map(partial(contain_detachededges, distdict=distdict, epsilon=epsilon), facesiter(simplex)))

class AlphaComplex(SimplicialComplex):
def __init__(self, points, epsilon, labels=None, distfcn=distance.euclidean):
self.pts = points
self.labels = range(len(self.pts)) if labels==None or len(labels)!=len(self.pts) else labels
self.epsilon = epsilon
self.distfcn = distfcn
self.import_simplices(self.construct_simplices(self.pts, self.labels, self.epsilon, self.distfcn))

def calculate_distmatrix(self, points, labels, distfcn):
distdict = {}
for i in range(len(labels)):
for j in range(len(labels)):
distdict[(labels[i], labels[j])] = distfcn(points[i], points[j])
return distdict

def construct_simplices(self, points, labels, epsilon, distfcn):
delaunay = Delaunay(points)
delaunay_simplices = map(tuple, delaunay.simplices)
distdict = self.calculate_distmatrix(points, labels, distfcn)

simplices = []
for simplex in delaunay_simplices:
faces = list(facesiter(simplex))
detached = map(partial(contain_detachededges, distdict=distdict, epsilon=epsilon), faces)
if reduce(or_, detached):
if len(simplex)&gt;2:
for face, notkeep in zip(faces, detached):
if not notkeep:
simplices.append(face)
else:
simplices.append(simplex)
simplices = map(lambda simplex: tuple(sorted(simplex)), simplices)
simplices = list(set(simplices))

allpts = get_allpoints(simplices)
for point in (set(labels)-allpts):
simplices += [(point,)]

return simplices


The scipy package already has a package to calculate Delaunay region. The function contain_detachededges is for constructing the restricted Voronoi region from the calculated Delaunay region.

This class demonstrates how an Alpha Complex is constructed, but this runs slowly once the number of points gets big!

Vietoris-Rips (VR) Complex

Another commonly used complex is called the Vietoris-Rips (VR) Complex, which connects points as the edge of a graph if they are close enough. (Refer to Zomorodian’s review [Zomorodian 2011] or this Wikipedia page for details.) To implement this, import the famous networkx originally designed for network analysis.

import networkx as nx
from scipy.spatial import distance
from itertools import product

class VietorisRipsComplex(SimplicialComplex):
def __init__(self, points, epsilon, labels=None, distfcn=distance.euclidean):
self.pts = points
self.labels = range(len(self.pts)) if labels==None or len(labels)!=len(self.pts) else labels
self.epsilon = epsilon
self.distfcn = distfcn
self.network = self.construct_network(self.pts, self.labels, self.epsilon, self.distfcn)
self.import_simplices(map(tuple, list(nx.find_cliques(self.network))))

def construct_network(self, points, labels, epsilon, distfcn):
g = nx.Graph()
zips = zip(points, labels)
for pair in product(zips, zips):
if pair!=pair:
dist = distfcn(pair, pair)
if dist&lt;epsilon:
return g


The intuitiveness and efficiencies are the reasons that VR complexes are widely used.

For more details about the Alpha Complexes, VR Complexes and the related Čech Complexes, refer to this page.

More…

There are other commonly used complexes used, including Witness Complex, Cubical Complex etc., which I leave no introductions. Upon building the complexes, we can analyze the topology by calculating their homology groups, Betti numbers, the persistent homology etc. I wish to write more about it soon. 