Quantum computing has been picking up the momentum, and there are many startups and scholars discussing quantum machine learning. A basic knowledge of quantum two-level computation ought to be acquired.

Recently, Rigetti, a startup for quantum computing service in Bay Area, published that they opened to public their cloud server for users to simulate the use of quantum instruction language, as described in their blog and their White Paper. It is free.

Go to their homepage, http://rigetti.com/, click on “Get Started,” and fill in your information and e-mail. Then you will be e-mailed keys of your cloud account. Copy the information to a file .pyquil_config, and in your .bash_profile, add a line

export PYQUIL_CONFIG="\$HOME/.pyquil_config"

More information can be found in their Installation tutorial. Then install the Python package pyquil, by typing in the command line:

pip install -U pyquil

Some of you may need to root (adding sudo in front).

Then we can go ahead to open Python, or iPython, or Jupyter notebook, to play with it. For the time being, let me play with creating an entangled singlet state, $\frac{1}{\sqrt{2}} (|01\rangle - |10\rangle )$. The corresponding quantum circuit is like this:

First of all, import all necessary libraries:

import numpy as np

from pyquil.quil import Program
import pyquil.api as api
from pyquil.gates import H, X, Z, CNOT


You can see that the package includes a lot of quantum gates. First, we need to instantiate a quantum simulator:

# starting the quantum simulator
quantum_simulator = api.SyncConnection()


Then we implement the quantum circuit with a “program” as follow:

# generating singlet state
# 1. Hadamard gate
# 2. Pauli-Z
# 3. CNOT
# 4. NOT
p = Program(H(0), Z(0), CNOT(0, 1), X(1))
wavefunc, _ = quantum_simulator.wavefunction(p)


The last line gives the final wavefunction after running the quantum circuit, or “program.” For the ket, the rightmost qubit is qubit 0, and the left of it is qubit 1, and so on. Therefore, in the first line of the program, H, the Hadamard gate, acts on qubit 0, i.e., the rightmost qubit. Running a simple print statement:

print wavefunc


gives

(-0.7071067812+0j)|01> + (0.7071067812+0j)|10>

The coefficients are complex, and the imaginary part is described by j. You can extract it as a numpy array:

wavefunc.amplitudes


If we want to calculate the metric of entanglement, we can use the Python package pyqentangle, which can be installed by running on the console:

pip install -U pyqentangle

Import them:

from pyqentangle import schmidt_decomposition
from pyqentangle.schmidt import bipartitepurestate_reduceddensitymatrix
from pyqentangle.metrics import entanglement_entropy, negativity


Because pyqentangle does not recognize the coefficients in the same way as pyquil, but see each element as the coefficients of $|j i \rangle$, we need to reshape the final state first, by:

tensorcomp = wavefunc.amplitudes.reshape((2, 2))


Then perform Schmidt decomposition (which the Schmidt modes are actually trivial in this example):

# Schmidt decomposition
schmidt_modes = schmidt_decomposition(tensorcomp)
for prob, modeA, modeB in schmidt_modes:
print prob, ' : ', modeA, ' ', modeB


This outputs:

0.5  :  [ 0.+0.j  1.+0.j]   [ 1.+0.j  0.+0.j]
0.5  :  [-1.+0.j  0.+0.j]   [ 0.+0.j  1.+0.j]

Calculate the entanglement entropy and negativity from its reduced density matrix:

print 'Entanglement entropy = ', entanglement_entropy(bipartitepurestate_reduceddensitymatrix(tensorcomp, 0))
print 'Negativity = ', negativity(bipartitepurestate_reduceddensitymatrix(tensorcomp, 0))


which prints:

Entanglement entropy =  0.69314718056
Negativity =  -1.11022302463e-16

The calculation can be found in this thesis.

P.S.: The circuit was drawn by using the tool in this website, introduced by the Marco Cezero’s blog post. The corresponding json for the circuit is:

{"gate":[],{"gate":[], "circuit": [{"type":"h", "time":0, "targets":[0], "controls":[]},     {"type":"z", "time":1, "targets":[0], "controls":[]},     {"type":"x", "time":2, "targets":[1], "controls":[0]},     {"type":"x", "time":3, "targets":[1], "controls":[]}], "qubits":2,"input":[0,0]}
Continue reading "A First Glimpse of Rigetti’s Quantum Computing Cloud" →
Advertisements

There are situations that we deal with short text, probably messy, without a lot of training data. In that case, we need external semantic information. Instead of using the conventional bag-of-words (BOW) model, we should employ word-embedding models, such as Word2Vec, GloVe etc.

Suppose we want to perform supervised learning, with three subjects, described by the following Python dictionary:

classdict={'mathematics': ['linear algebra',
'topology',
'algebra',
'calculus',
'variational calculus',
'functional field',
'real analysis',
'complex analysis',
'differential equation',
'statistics',
'statistical optimization',
'probability',
'stochastic calculus',
'numerical analysis',
'differential geometry'],
'physics': ['renormalization',
'classical mechanics',
'quantum mechanics',
'statistical mechanics',
'functional field',
'path integral',
'quantum field theory',
'electrodynamics',
'condensed matter',
'particle physics',
'topological solitons',
'astrophysics',
'spontaneous symmetry breaking',
'atomic molecular and optical physics',
'quantum chaos'],
'theology': ['divine providence',
'soteriology',
'anthropology',
'pneumatology',
'Christology',
'Holy Trinity',
'eschatology',
'scripture',
'ecclesiology',
'predestination',
'divine degree',
'creedal confessionalism',
'scholasticism',
'prayer',
'eucharist']}


And we implemented Word2Vec here. To add external information, we use a pre-trained Word2Vec model from Google, downloaded here. We can use it with Python package gensim. To load it, enter

from gensim.models import Word2Vec
wvmodel = Word2Vec.load_word2vec_format('<path-to>/GoogleNews-vectors-negative300.bin.gz', binary=True)


How do we represent a phrase in Word2Vec? How do we do the classification? Here I wrote two classes to do it.

#### Average

We can represent a sentence by summing the word-embedding representations of each word. The class, inside SumWord2VecClassification.py, is coded as follow:

from collections import defaultdict

import numpy as np
from nltk import word_tokenize
from scipy.spatial.distance import cosine

from utils import ModelNotTrainedException

class SumEmbeddedVecClassifier:
def __init__(self, wvmodel, classdict, vecsize=300):
self.wvmodel = wvmodel
self.classdict = classdict
self.vecsize = vecsize
self.trained = False

def train(self):
self.addvec = defaultdict(lambda : np.zeros(self.vecsize))
for classtype in self.classdict:
for shorttext in self.classdict[classtype]:
self.addvec[classtype] += self.shorttext_to_embedvec(shorttext)
self.addvec[classtype] /= np.linalg.norm(self.addvec[classtype])
self.addvec = dict(self.addvec)
self.trained = True

def shorttext_to_embedvec(self, shorttext):
vec = np.zeros(self.vecsize)
tokens = word_tokenize(shorttext)
for token in tokens:
if token in self.wvmodel:
vec += self.wvmodel[token]
norm = np.linalg.norm(vec)
if norm!=0:
vec /= np.linalg.norm(vec)
return vec

def score(self, shorttext):
if not self.trained:
raise ModelNotTrainedException()
vec = self.shorttext_to_embedvec(shorttext)
scoredict = {}
for classtype in self.addvec:
try:
scoredict[classtype] = 1 - cosine(vec, self.addvec[classtype])
except ValueError:
scoredict[classtype] = np.nan
return scoredict


Here the exception ModelNotTrainedException is just an exception raised if the model has not been trained yet, but scoring function was called by the user. (Codes listed in my Github repository.) The similarity will be calculated by cosine similarity.

Such an implementation is easy to understand and carry out. It is good enough for a lot of application. However, it has the problem that it does not take the relation between words or word order into account.

#### Convolutional Neural Network

To tackle the problem of word relations, we have to use deeper neural networks. Yoon Kim published a well cited paper regarding this in EMNLP in 2014, titled “Convolutional Neural Networks for Sentence Classification.” The model architecture is as follow: (taken from his paper)

Each word is represented by an embedded vector, but neighboring words are related through the convolutional matrix. And MaxPooling and a dense neural network were implemented afterwards. His paper involves multiple filters with variable window sizes / spatial extent, but for our cases of short phrases, I just use one window of size 2 (similar to dealing with bigram). While Kim implemented using Theano (see his Github repository), I implemented using keras with Theano backend. The codes, inside CNNEmbedVecClassification.py, are as follow:

import numpy as np
from keras.layers import Convolution1D, MaxPooling1D, Flatten, Dense
from keras.models import Sequential
from nltk import word_tokenize

from utils import ModelNotTrainedException

class CNNEmbeddedVecClassifier:
def __init__(self,
wvmodel,
classdict,
n_gram,
vecsize=300,
nb_filters=1200,
maxlen=15):
self.wvmodel = wvmodel
self.classdict = classdict
self.n_gram = n_gram
self.vecsize = vecsize
self.nb_filters = nb_filters
self.maxlen = maxlen
self.trained = False

def convert_trainingdata_matrix(self):
classlabels = self.classdict.keys()
lblidx_dict = dict(zip(classlabels, range(len(classlabels))))

# tokenize the words, and determine the word length
phrases = []
indices = []
for label in classlabels:
for shorttext in self.classdict[label]:
category_bucket = [0]*len(classlabels)
category_bucket[lblidx_dict[label]] = 1
indices.append(category_bucket)
phrases.append(word_tokenize(shorttext))

# store embedded vectors
train_embedvec = np.zeros(shape=(len(phrases), self.maxlen, self.vecsize))
for i in range(len(phrases)):
for j in range(min(self.maxlen, len(phrases[i]))):
train_embedvec[i, j] = self.word_to_embedvec(phrases[i][j])
indices = np.array(indices, dtype=np.int)

return classlabels, train_embedvec, indices

def train(self):
# convert classdict to training input vectors
self.classlabels, train_embedvec, indices = self.convert_trainingdata_matrix()

# build the deep neural network model
model = Sequential()
model.add(Convolution1D(nb_filter=self.nb_filters,
filter_length=self.n_gram,
border_mode='valid',
activation='relu',
input_shape=(self.maxlen, self.vecsize)))
model.add(MaxPooling1D(pool_length=self.maxlen-self.n_gram+1))
model.add(Flatten())
model.add(Dense(len(self.classlabels), activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

# train the model
model.fit(train_embedvec, indices)

# flag switch
self.model = model
self.trained = True

def word_to_embedvec(self, word):
return self.wvmodel[word] if word in self.wvmodel else np.zeros(self.vecsize)

def shorttext_to_matrix(self, shorttext):
tokens = word_tokenize(shorttext)
matrix = np.zeros((self.maxlen, self.vecsize))
for i in range(min(self.maxlen, len(tokens))):
matrix[i] = self.word_to_embedvec(tokens[i])
return matrix

def score(self, shorttext):
if not self.trained:
raise ModelNotTrainedException()

# retrieve vector
matrix = np.array([self.shorttext_to_matrix(shorttext)])

# classification using the neural network
predictions = self.model.predict(matrix)

# wrangle output result
scoredict = {}
for idx, classlabel in zip(range(len(self.classlabels)), self.classlabels):
scoredict[classlabel] = predictions[0][idx]
return scoredict


The output is a vector of length equal to the number of class labels, 3 in our example. The elements of the output vector add up to one, indicating its score, and a nature of probability.

#### Evaluation

A simple cross-validation to the example data set does not tell a difference between the two algorithms:

However, we can test the algorithm with a few examples:

Example 1: “renormalization”

• Average: {‘mathematics’: 0.54135105096749336, ‘physics’: 0.63665460856632494, ‘theology’: 0.31014049736087901}
• CNN: {‘mathematics’: 0.093827009201049805, ‘physics’: 0.85451591014862061, ‘theology’: 0.051657050848007202}

As renormalization was a strong word in the training data, it gives an easy result. CNN can distinguish much more clearly.

Example 2: “salvation”

• Average: {‘mathematics’: 0.14939650156482298, ‘physics’: 0.21692765541184023, ‘theology’: 0.5698233329716329}
• CNN: {‘mathematics’: 0.012395491823554039, ‘physics’: 0.022725773975253105, ‘theology’: 0.96487873792648315}

“Salvation” is not found in the training data, but it is closely related to “soteriology,” which means the doctrine of salvation. So it correctly identifies it with theology.

Example 3: “coffee”

• Average: {‘mathematics’: 0.096820211601723272, ‘physics’: 0.081567332119268032, ‘theology’: 0.15962682945135631}
• CNN: {‘mathematics’: 0.27321341633796692, ‘physics’: 0.1950736939907074, ‘theology’: 0.53171288967132568}

Coffee is not related to all subjects. The first architecture correctly indicates the fact, but CNN, with its probabilistic nature, has to roughly equally distribute it (but not so well.)

The code can be found in my Github repository: stephenhky/PyShortTextCategorization. (This repository has been updated since this article was published. The link shows the version of the code when this appeared online.)

Ever since Mehta and Schwab laid out the relationship between restricted Boltzmann machines (RBM) and deep learning mathematically (see my previous entry), scientists have been discussing why deep learning works so well. Recently, Henry Lin and Max Tegmark put a preprint on arXiv (arXiv:1609.09225), arguing that deep learning works because it captures a few essential physical laws and properties. Tegmark is a cosmologist.

Physical laws are simple in a way that a few properties, such as locality, symmetry, hierarchy etc., lead to large-scale, universal, and often complex phenomena. A lot of machine learning algorithms, including deep learning algorithms, have deep relations with formalisms outlined in statistical mechanics.

A lot of machine learning algorithms are basically probability theory. They outlined a few types of algorithms that seek various types of probabilities. They related the probabilities to Hamiltonians in many-body systems.

They argued why neural networks can approximate functions (polynomials) so well, giving a simple neural network performing multiplication. With central limit theorem or Jaynes’ arguments (see my previous entry), a lot of multiplications, they said, can be approximated by low-order polynomial Hamiltonian. This is like a lot of many-body systems that can be approximated by 4-th order Landau-Ginzburg-Wilson (LGW) functional.

Properties such as locality reduces the number of hyper-parameters needed because it restricts to interactions among close proximities. Symmetry further reduces it, and also computational complexities. Symmetry and second order phase transition make scaling hypothesis possible, leading to the use of the tools such as renormalization group (RG). As many people have been arguing, deep learning resembles RG because it filters out unnecessary information and maps out the crucial features. Tegmark use classifying cats vs. dogs as an example, as in retrieving temperatures of a many-body systems using RG procedure. They gave a counter-example to Schwab’s paper with the probabilities cannot be preserved by RG procedure, but while it is sound, but it is not the point of the RG procedure anyway.

They also discussed about the no-flattening theorems for neural networks.

A lot of Americans are addicted to coffee. I have one cup a day. I know someone has a few cups every day. Weird enough, I know someone who needs coffee to go to bed.

Coffee is for sipping, but we do want to enjoy it as soon as possible at a right temperature. Whether we add sugar cubes at the beginning or the end can make a lot of difference. In general, coffee cooling follows empirically the Newton’s law of cooling:

$\frac{dT}{dt} = -k (T-T_r)$,

where $T_r$ is the room temperature, and $k$ is the cooling parameter. It is a simple ordinary differential equation (ODE) that bears analytical solution. However, let’s solve it numerically while we demonstrate the NumPy/SciPy capabilities. The function we are using is odeint. Let’s import necessary packages:

import numpy as np
from scipy.integrate import odeint
from functools import partial


We will need the partial function from functools shortly. Let’s define the right-hand-side of the ODE above as the cooling function in terms of a lambda expression:

coolingfcn = lambda temp, t0, k, roomtemp: -k*(temp-roomtemp)


And we need a set of cooling parameters that makes sense from our daily experience:

cooling_parameters = {'init_temp': 94.0, # initial temperature of the coffee (degrees Celcius)
'cube_dectemp': 5.0, # temperature decrease due to the cube
'roomtemp': 25.0, # room temperature
'k': 1.0/15.0 # cooling parameter (inverse degrees Celcius)
}


The fresh-brewed coffee is of 94 degrees Celcius. The room temperature is 25 degrees Celcius. Whenever a whole cube is added to the coffee, its temperature drops by 5 degrees Celcius, which is not a good assumption when the coffee is totally cooled down to room temperature. The cooling parameter $k=\frac{1}{15} \text{min}^{-1}$, meaning in 15 minutes, the coffee was cooled by $e^{-1}$ of its temperature difference from the environment.

From the SciPy documentation of odeint, the first three parameters are important:

scipy.integrate.odeint(func, y0, t, args=(), Dfun=None, col_deriv=0, full_output=0, ml=None, mu=None, rtol=None,atol=None, tcrit=None, h0=0.0, hmax=0.0, hmin=0.0, ixpr=0, mxstep=0, mxhnil=0, mxordn=12, mxords=5,printmessg=0)

The first parameter is the cooling function, and the second is the initial value of the dependent function (temperature in this case), and the third is the ndarray of times of the temperatures to be calculated. We can easily write the functions that return the ndarray of the temperatures for the cases of adding sugar cube before and after the cooling process respectively:

# adding sugar cube before cooling
def cube_before_cooling(t, init_temp, cube_dectemp, roomtemp, k):
y0 = init_temp - cube_dectemp
temps = odeint(partial(coolingfcn, k=k, roomtemp=roomtemp), y0, t)
return temps

# adding sugar cube after cooling
def cube_after_cooling(t, init_temp, cube_dectemp, roomtemp, k):
temps = odeint(partial(coolingfcn, k=k, roomtemp=roomtemp), init_temp, t)
temps[-1] -= cube_dectemp
return temps


We can plot the temperature changes on graphs using matplotlib.

import matplotlib.pyplot as plt


And we are interested in the first 20 minutes:

times = np.linspace(0, 20, 201)


For adding a sugar cube before cooling,

temps_beforecooling = cube_before_cooling(times, **cooling_parameters)
plt.plot(times, temps_beforecooling)
plt.xlabel('Time (min)')
plt.ylabel('Coffee Temperature (degrees Celcius)')


(If you are typing in an iPython shell, you need to put plt.show() to show the plot.) This gives the following plot:

And for adding the sugar cube after cooling,

temps_aftercooling = cube_after_cooling(times, **cooling_parameters)
plt.plot(times, temps_aftercooling)
plt.xlabel('Time (min)')
plt.ylabel('Coffee Temperature (degrees Celcius)')


This gives the following plot:

Obviously, adding the sugar cube after cooling gives a lower temperature.

Advanced Gimmicks

How about we add sugar continuously throughout the 20-minute time span?  We need to adjust our differential equation to be:

$\frac{dT}{dt} = -k (T-T_r) - \frac{T_{\text{drop}}}{\delta t}$,

which can be implemented by the following code:

# adding sugar continuously
def continuous_sugar_cooling(t, init_temp, cube_dectemp, roomtemp, k):
timespan = t[-1] - t[0]
newcoolingfcn = lambda temp, t0: coolingfcn(temp, t0, k, roomtemp) - cube_dectemp / timespan
temps = odeint(newcoolingfcn, init_temp, t)
return temps


or we can divide the cube to be added to the cup at a few time spots, which can be implemented by the following code that employs our previously defined functions:

# adding sugar at a specified time(s)
def sugar_specifiedtime_cooling(t, sugar_time_indices, init_temp, cube_dectemp, roomtemp, k):
sorted_sugar_time_indices = np.sort(sugar_time_indices)
num_portions = len(sorted_sugar_time_indices)

temps = np.array([])
temp = init_temp
t_segments = np.split(t, sorted_sugar_time_indices)
num_segments = len(t_segments)
for i in range(num_segments):
temp_segment = cube_after_cooling(t_segments[i],
temp,
float(cube_dectemp)/num_portions,
roomtemp,
k)
temps = np.append(temps, temp_segment)
temp = temp_segment[-1]
temps[-1] += float(cube_dectemp)/num_portions

return temps


Let’s calculate all the temperatures:

temps_cont = continuous_sugar_cooling(times, **cooling_parameters)
temps_n = sugar_specifiedtime_cooling(times, [50, 100, 150], **cooling_parameters)


And plot all the graphs together:

plt.plot(times, temps_beforecooling, label='cube before cooling')
plt.plot(times, temps_aftercooling, label='cube after cooling')
plt.plot(times, temps_cont, label='continuous')
plt.plot(times, temps_n, label='3 times of dropping cube')
plt.xlabel('Time (min)')
plt.ylabel('Coffee Temperature (degrees Celcius)')
plt.legend()


Complete code demonstration can be found in my github (GitHub/stephenhky/CoffeeCooling)

Justin Trudeau, the Canadian Prime Minister, rocked the world recently by talking about quantum computers in front of reporters at Perimeter Institute…

No fault can be found in his answer, although his answer cannot be more layman. A few physicists were challenged to give a layman introduction about quantum computers, and they are equally well. (See MacLean’s article.)

The research about quantum computers has been for decades, and it is not until recently some commercial products are available in the market.

### Qubits and Quantum Entanglement

Trudeau is right that in quantum computers, a bit, called a qubit, is more than just 0 and 1, but a complicated combination of them. It can be characterized by a quantum state $|\psi \rangle = \alpha |0\rangle + \beta |1\rangle$, where $\alpha$ and $\beta$ are complex numbers. Such a state can be geometrically demonstrated as a Bloch sphere, where the possibility of a state is infinitely more than simply up and down.

Besides qubits, quantum entanglement makes quantum computing unique, because with entanglement it allows algorithms to be calculated in parallel at the same time by its nature. Quantum entanglement concerns the correlation between two qubits in a way that the quantum state of the two qubits cannot be separated into two independent quantum states, each for one qubit. There are many ways to quantify the entanglement of a bipartite state (consisting of two qubits), such as entanglement entropy (similar to Shannon entropy, see this, where the probability is on the partial density matrix calculated by Schmidt decomposition), and negativity. [Peres, 1996]

The qubits and entanglement make quantum algorithms possible. It is a big research field, bordering physics and computer science. See “Quantum Algorithm Zoo” cataloged by NIST for them. Shor’s algorithm and Grover’s algorithm are some of the famous ones.

There are two types of quantum computers in general, namely, the universal quantum computer (UQC), and non-universal adiabatic quantum computer (AQC).

### Universal Quantum Computer (UQC)

UQC, the quantum computers that exploit quantum properties to perform computations that are classically infeasible, has been studied for decades in academia. Not only the quantum algorithms, the realization of hardware has always been one of the hot topics of scientific research. Although much time and money has been invested, it is still far from industrial and commercial use.

The first UQC on earth was made in MIT in 1998, by Isaac Chuang and his colleagues, using NMR techniques. [Chuang et. al., 1998] There are also other experimental realization of qubits, such as optical cavity, Josephson junction, quantum dots, nanomechanical resonator etc. Representations depend on the physics systems.

If this is successfully realized in industrialized use, it speeds up a lot of computations by fully applying the quantum algorithms.

### Adiabatic Quantum Computer (AQC)

AQC does not fully exploit the quantum properties. While there are qubits, the algorithms may not be fully applied. Instead, it partially exploits quantum properties to accelerate current algorithms. It takes advantage of adiabatic theorem, which states that a slow perturbation to a quantum system remains the instantaneous eigenstate. This means, if the quantum system starts at a ground state, it stays at the ground state after the slow perturbation due to some external field. This process possibly involves quantum tunneling. AQC is actually a type of quantum annealing.

There are already some commercial applications of AQC. D-Wave Systems made their first AQC in the world, and a number of big companies such as Lockheed Martin and Google purchased them. D-Wave systems are based on Josephson junctions. It is programmable, and most suited to optimization problems. One can write their own optimization problem in terms of the Ising model:

$H = \sum_i h_i x_i + \sum_{\langle i, j \rangle} J_{ij} x_i x_j$,

The system will start at an initial system with its ground state, and slowly changing the interactions to this Hamiltonian. The state will be the ground state, or the optimized solution, of the problem.

We can immediately see that, in the era of big data, this is very useful as machine learning problems are optimization problems. For example, QxBranch, a startup based on Washington, DC, developed tools using D-Wave Systems to perform some machine learning algorithms.

### Topological Quantum Computer

In some sense, an AQC is only a partial quantum computer, which does not employ full quantum properties. However, a UQC is hard to achieve because of quantum decoherence. A state can exist for only a very short time. There exist ways to avoid this, for example, optimal dynamic decoupling. [Fu et. al., 2009] Another fascinating idea to overcome this is topological quantum computing.

A topological quantum computer is a theoretical quantum computer that is based on anyons, a two-dimensional quasi-particles with world lines crossing over in three-dimensional world. It was first introduced in 1997 by Alexei Kitaev. The idea is to exploit the ideas of topology in anyons to preserve the state, because it is extremely costly to change the topology of a system. In 2005, Das Sarma, Freedman, and Nayak proposed using fractional quantum Hall system to achieve this. [Das Sarma et. al., 2006] Some ideas such as Majorana fermions have been proposed too. All in all, topological quantum computing is still a theoretical idea.

Topology has been shown to reveal important information about geometry and shape from data, [Carlsson 2015][Carlsson 2009] as I have talked about in various TDA blog entries. I have also demonstrated how to describe the topology if discrete data points by constructing simplicial complexes, and then calculated the homology and Betti numbers. (I will talk about persistent homology in the future.) Dealing with discrete data points in our digital storage devices, homology is the best way to describe it.

But if you are from a physics background, you may be familiar with the concept of homotopy and fundamental group. Some physicists deal with topology without digging into advanced mathematical tools but simply through solitons. There is a well-written introduction in this blog. In the physical world, an object is said to be topological if:

• there is a singular point that cannot be removed by a continuous deformation of field; [Mermin 1979]
• it has a saddle-point equation of the model that is different from another object of another topology, [Rajaraman 1987] inducing different kinds of physical dynamics; [Bray 1994]
• it can only be removed by crossing an energy barrier, which can be described by an instanton; [Calzetta, Ho, Hu 2010]
• it can proliferate through Kosterlitz-Thouless (BKT) phase transition; [Kosterliz, Thouless 1973]
• it can form in a system through a second-order phase transition at a finite rate, a process known as Kibble-Zurek mechanism (KZM); [Kibble 1976] [Zurek 1985] and
• its topology can be described by a winding number. (c.f. Betti numbers in homology)

Topological objects include vortices in magnets, superfluids, superconductors, or Skyrmions in helimagnets. [Mühlbauer et. al. 2009] [Ho et. al. 2010] They may come in honeycomb order, like Abrikosov vortices in type-II superconductors, [Abrikosov 1957] and helical nanofilaments in smectics. [Matsumoto et. al. 2009] It is widely used in fractional quantum Hall effect [Tsui et. al. 1982] and topological insulators (a lot of references can be found…). They can all be described using homotopy and winding numbers. We can see that topology is useful to describe the physical world for the complexities and patterns. There are ideas in string-net theory to use topology to describe the emergence of patterns and new phases of quantum matter. [Zeng et. al. 2015] Of course, I must not omit topological quantum computing that makes the qubits immune to environmental noise. [Das Sarma, Freedman, Nayak 2005]

However in data analytics, we do not use homotopy, albeit its beauty and usefulness in the physical world. Here are some of the reasons:

• In using homotopy, sometimes it takes decades for a lot of brains to figure out which homotopy groups to use. But in data analysis, we want to grasp the topology simply from data.
• Homotopy deals with continuous mappings, but data are discrete. Simplicial homology captures it more easily.
• In a physical system, we deal with usually one type of homotopy groups. But in data, we often deal with various topologies which we are not aware of in advance. Betti numbers can describe the topology easily by looking at data.
• Of course, homotopy is difficult to compute numerically.

Afra Zomorodian argued the use of homology over homotopy in his book as well. [Zomorodian 2009]

In my very first class to introductory college physics, I was told that a good scientific model have a general descriptive power. While it is true, I found that it is a oversimplified statement after I was exposed to computational science and other fields outside physics.

Descriptive power is important, as in Shelling model in economics; but to many scientists and engineers, a model is devised because of its predictive power, which can be seen as one aspect of its descriptive power. Predictive power is a useful feature. All physics and engineering models have predictive power in a quantitative sense.

Physics models have to be descriptive in a sense that the models are describing physical things; but in the big data era, a lot of machine learning models are like black boxes, which means we care only about the meaning of inputs and outputs, but the content of the models do not necessarily carry a meaning. SVM and deep learning are good examples. (This in fact bothers some people.) But of course, in physics, there are a lot of phenomenological models that are between descriptive models and black boxes, such as the Ginzburg-Landau-Wilson (GLW) model widely used in magnets, superfluids, helimagnets, superconductors, liquid crystals etc. However, to be fair, some machine learning models are quite descriptive, such as clustering, Gaussian mixtures, MaxEnt etc.

A lot of traditional physics models are equation-based, but computational models are not. The reasons are evident. And of course, traditional physics models are mostly continuous while computational models are sometimes discrete. If the computational models are continuous and equation-based, they will be translated to discrete version for the computing machines to handle correctly.

However, whichever forms the models take, we, human beings, are essential to give the models meaning so that they are useful.

Deep learning, a collection of related neural network algorithms, has been proved successful in certain types of machine learning tasks in computer vision, speech recognition, data cleaning, and natural language processing (NLP). [Mikolov et. al. 2013] However, it was unclear how deep learning can be so successful. It looks like a black box with messy inputs and excellent outputs. So why is it so successful?

A friend of mine showed me this article in the preprint (arXiv:1410.3831) [Mehta & Schwab 2014] last year, which mathematically shows the equivalence of deep learning and renormalization group (RG). RG is a concept in theoretical physics that has been widely applied in different problems, including critical phenomena, self-organized criticality, particle physics, polymer physics, and strongly correlated electronic systems. And now, Mehta and Schwab showed that an explanation to the performance of deep learning is available through RG.

So what is RG? Before RG, Leo Kadanoff, a physics professor in University of Chicago, proposed an idea of coarse-graining in studying many-body problems in 1966. [Kadanoff 1966] In 1972, Kenneth Wilson and Michael Fisher succeeded in applying ɛ-expansion in perturbative RG to explain the critical exponents in Landau-Ginzburg-Wilson (LGW) Hamiltonian. [Wilson & Fisher 1972] This work has been the standard material of graduate physics courses. In 1974, Kenneth Wilson applied RG to explain the Kondo problem, which led to his Nobel Prize in Physics in 1982. [Wilson 1983]

RG assumes a system of scale invariance, which means the system are similar in whatever scale you are seeing. One example is the chaotic system as in Fig. 1. The system looks the same when you zoom in. We call this scale-invariant system self-similar. And physical systems closed to phase transition are self-similar. And if it is self-similar, Kadanoff’s idea of coarse-graining is then applicable, as in Fig. 2. Four spins can be viewed as one spin that “summarizes” the four spins in that block without changing the description of the physical system. This is somewhat like we “zoom out” the picture on Photoshop or Web Browser.

[Taken from [Singh 2014]]

So what’s the point of zooming out? Physicists care about the Helmholtz free energies of physical systems, which are similar to cost functions to the computer scientists and machine learning specialists. Both are to be minimized. However, whatever scale we are viewing at, the energy of the system should be scale-invariant. Therefore, as we zoom out, the system “changes” yet “looks the same” due to self-similarity, but the energy stays the same. The form of the model is unchanged, but the parameters change as the scale changes.

This is important, because this process tells us which parameters are relevant, and which others are irrelevant. Why? Think of it this way: we have an awesome computer to simulate a glass of water that contains 1023 water molecules. To describe the systems, you have all parameters, including the position of molecules, strength of Van der Waals force, orbital angular momentum of each atom, strength of the covalent bonds, velocities of the molecules… You might have 1025 parameters. However, this awesome computer cannot handle such a system with so many parameters. Then you try to coarse-grain the system, and you discard some parameters in each step of coarse-graining. After numerous steps, it turns out that the temperature and the pressure are the only relevant parameters.

RG helps you identify the relevant parameters.

And it is exactly what happened in deep learning. In each convolutional cycle, features that are not important are gradually discarded, and those that are important are kept and enhanced. Indeed, in computer vision and NLP, the data are so noisy that there are a lot of unnecessary information. Deep learning gradually discards these information. As Mehta and Schwab stated, [Mehta & Schwab 2014]

Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.

So what is the point of understanding this? Unlike other machine algorithms, we did not know how it works, which sometimes makes model building very difficult because we have no idea how to adjust parameters. I believe understanding its equivalence to RG helps guide us to build a model that works.

Charles Martin also wrote a blog entry with more demonstration about the equivalence of deep learning and RG. [Martin 2015]

Taken from the movie “Beautiful Mind”

John Nash’s death on May 23, 2015 on the New Jersey Turnpike was a tragedy. However, his contribution to mathematics and economics is everlasting. His contribution to game theory led to his sharing the 1994 Nobel Memorial Prize for Economical Sciences.

Coincidentally, three weeks before his accidental death, there was an econophysics paper that employed his ideas of Nash equilibrium. Econophysics has been an inter-disciplinary quantitative field since 1990s. Victor Yakovenko, a physics professor in University of Maryland, applied the techniques of classical statistical mechanics, and concluded that the wealth of bottom 95% population follows Boltzmann-Gibbs exponential distribution, while the top a Pareto distribution. [Dragulescu & Yakovenko 2000] This approach assumes agents  to have nearly “zero intelligence,” and behave randomly with no intent and purpose, contrary to the conventional assumption in economics that agents are perfectly rational, with purpose to maximize utility or profit.

This paper, written by Venkat Venkatasubramanian, described an approach aiming at reconciling econophysics and conventional economics, using the ideas in game theory. [Venkatasubramanian, Luo  & Sethuraman 2015] Like statistical mechanics, it assumes the agents to be particles. Money plays the role of energy, just like other econophysics theory. The equilibrium state is the state with maximum entropy. However, it employed the idea of game theory, adding that the agents are intelligent and in a game, unlike molecules in traditional statistical mechanics. The equilibrium state is not simply the maximum entropic state, but also the Nash equilibrium. This reconciles econophysics and conventional economics. And it even further argues that, unlike equilibrium in thermodynamics being probabilistic in nature, this economical equilibrium is deterministic. And the expected distribution is log-normal distribution. (This log-normal distribution is hard to fit, which is another obstacles for economists to accept physical approach to economics.)

With this framework, Venkatasubramanian discussed about income inequality. Income inequality has aroused debates in the recent few years, especially after the detrimental financial crisis in 2008. Is capitalism not working now? Does capitalism produce unfairness? He connected entropy with the concept of fairness, or fairest inequality. And the state with maximum entropy is the fairest state. And, of course, the wealth distribution is the log-normal distribution. His study showed that:[http://phys.org/news/2015-05-fair-theory-income-inequality.html]

“Scandinavian countries and, to a lesser extent, Switzerland, Netherlands, and Australia have managed, in practice, to get close to the ideal distribution for the bottom 99% of the population, while the U.S. and U.K. remain less fair at the other extreme. Other European countries such as France and Germany, and Japan and Canada, are in the middle.”

See the figure at the end of this post about the discrepancy of the economies of a few countries to the maximum entropic state, or ideality. And [Venkatasubramanian, Luo  & Sethuraman 2015]

“Even the US economy operated a lot closer to ideality, during ∼1945–75, than it does now. It is important to emphasize that in those three decades US performed extremely well economically, dominating the global economy in almost every sector.”

They even argued that these insights in economics might shed light to traditional statistical thermodynamics.

I have to say that I love this work because not only it explains real-world problem, but also links physics and economics in a beautiful way.