Almost three years ago, I wrote a blog entry titled Useful Python Packages, which listed the essential packages that I deemed important. How has the list been changed over the past three years?

First of all, three years ago, most people were still writing Python 2.7. But now there is a trend to switch to Python 3. I admitted that I still have not started the switch yet, but in the short term, I will have no choice and I will.

What are some of the essential packages?
Numerical Packages

• numpy: numerical Python, containing most basic numerical routines such as matrix manipulation, linear algebra, random sampling, numerical integration etc. There is a built-in wrapper for Fortran as well. Actually, numpy is so important that some Linux system includes it with Python.
• scipy: scientific Python, containing some functions useful for scientific computing, such as sparse matrices, numerical differential equations, advanced linear algebra, special functions etc.
• networkx: package that handles various types of networks
• PuLP: linear programming
• cvxopt: convex optimization

Data Visualization

• matplotlib: basic plotting.
• ggplot2: the ggplot2 counterpart in Python for producing quality publication plots.

Data Manipulation

• pandas: data manipulation, working with data frames in Python, and save/load of various formats such as CSV and Excel

Machine Learning

• scikit-learn: machine-learning library in Python, containing classes and functions for supervised and unsupervised learning

Probabilistic Programming

• PyMC: Metropolis-Hasting algorithm
• Edward: deep probabilistic programing

Deep Learning Frameworks

• TensorFlow: because of Google’s marketing effort, TensorFlow is now the industrial standard for building deep learning networks, with rich source of mathematical functions, esp. for neural network cells, with GPU capability
• Keras: containing routines of high-level layers for deep learning neural networks, with TensorFlow, Theano, or CNTK as the backbone
• PyTorch: a rivalry against TensorFlow

Natural Language Processing

• nltk: natural language processing toolkit for Python, containing bag-of-words model, tokenizer, stemmers, chunker, lemmatizers, part-of-speech taggers etc.
• gensim: a useful natural language processing package useful for topic modeling, word-embedding, latent semantic indexing etc., running in a fast fashion
• shorttext: text mining package good for handling short sentences, that provide high-level routines for training neural network classifiers, or generating feature represented by topic models or autoencodings.
• spacy: industrial standard for natural language processing common tools

GUI

I can probably list more, but I think I covered most of them. If you do not find something useful, it is probably time for you to write a brand new package.

A lot of Americans are addicted to coffee. I have one cup a day. I know someone has a few cups every day. Weird enough, I know someone who needs coffee to go to bed.

Coffee is for sipping, but we do want to enjoy it as soon as possible at a right temperature. Whether we add sugar cubes at the beginning or the end can make a lot of difference. In general, coffee cooling follows empirically the Newton’s law of cooling:

$\frac{dT}{dt} = -k (T-T_r)$,

where $T_r$ is the room temperature, and $k$ is the cooling parameter. It is a simple ordinary differential equation (ODE) that bears analytical solution. However, let’s solve it numerically while we demonstrate the NumPy/SciPy capabilities. The function we are using is odeint. Let’s import necessary packages:

import numpy as np
from scipy.integrate import odeint
from functools import partial


We will need the partial function from functools shortly. Let’s define the right-hand-side of the ODE above as the cooling function in terms of a lambda expression:

coolingfcn = lambda temp, t0, k, roomtemp: -k*(temp-roomtemp)


And we need a set of cooling parameters that makes sense from our daily experience:

cooling_parameters = {'init_temp': 94.0, # initial temperature of the coffee (degrees Celcius)
'cube_dectemp': 5.0, # temperature decrease due to the cube
'roomtemp': 25.0, # room temperature
'k': 1.0/15.0 # cooling parameter (inverse degrees Celcius)
}


The fresh-brewed coffee is of 94 degrees Celcius. The room temperature is 25 degrees Celcius. Whenever a whole cube is added to the coffee, its temperature drops by 5 degrees Celcius, which is not a good assumption when the coffee is totally cooled down to room temperature. The cooling parameter $k=\frac{1}{15} \text{min}^{-1}$, meaning in 15 minutes, the coffee was cooled by $e^{-1}$ of its temperature difference from the environment.

From the SciPy documentation of odeint, the first three parameters are important:

scipy.integrate.odeint(func, y0, t, args=(), Dfun=None, col_deriv=0, full_output=0, ml=None, mu=None, rtol=None,atol=None, tcrit=None, h0=0.0, hmax=0.0, hmin=0.0, ixpr=0, mxstep=0, mxhnil=0, mxordn=12, mxords=5,printmessg=0)

The first parameter is the cooling function, and the second is the initial value of the dependent function (temperature in this case), and the third is the ndarray of times of the temperatures to be calculated. We can easily write the functions that return the ndarray of the temperatures for the cases of adding sugar cube before and after the cooling process respectively:

# adding sugar cube before cooling
def cube_before_cooling(t, init_temp, cube_dectemp, roomtemp, k):
y0 = init_temp - cube_dectemp
temps = odeint(partial(coolingfcn, k=k, roomtemp=roomtemp), y0, t)
return temps

# adding sugar cube after cooling
def cube_after_cooling(t, init_temp, cube_dectemp, roomtemp, k):
temps = odeint(partial(coolingfcn, k=k, roomtemp=roomtemp), init_temp, t)
temps[-1] -= cube_dectemp
return temps


We can plot the temperature changes on graphs using matplotlib.

import matplotlib.pyplot as plt


And we are interested in the first 20 minutes:

times = np.linspace(0, 20, 201)


For adding a sugar cube before cooling,

temps_beforecooling = cube_before_cooling(times, **cooling_parameters)
plt.plot(times, temps_beforecooling)
plt.xlabel('Time (min)')
plt.ylabel('Coffee Temperature (degrees Celcius)')


(If you are typing in an iPython shell, you need to put plt.show() to show the plot.) This gives the following plot:

And for adding the sugar cube after cooling,

temps_aftercooling = cube_after_cooling(times, **cooling_parameters)
plt.plot(times, temps_aftercooling)
plt.xlabel('Time (min)')
plt.ylabel('Coffee Temperature (degrees Celcius)')


This gives the following plot:

Obviously, adding the sugar cube after cooling gives a lower temperature.

How about we add sugar continuously throughout the 20-minute time span?  We need to adjust our differential equation to be:

$\frac{dT}{dt} = -k (T-T_r) - \frac{T_{\text{drop}}}{\delta t}$,

which can be implemented by the following code:

# adding sugar continuously
def continuous_sugar_cooling(t, init_temp, cube_dectemp, roomtemp, k):
timespan = t[-1] - t[0]
newcoolingfcn = lambda temp, t0: coolingfcn(temp, t0, k, roomtemp) - cube_dectemp / timespan
temps = odeint(newcoolingfcn, init_temp, t)
return temps


or we can divide the cube to be added to the cup at a few time spots, which can be implemented by the following code that employs our previously defined functions:

# adding sugar at a specified time(s)
def sugar_specifiedtime_cooling(t, sugar_time_indices, init_temp, cube_dectemp, roomtemp, k):
sorted_sugar_time_indices = np.sort(sugar_time_indices)
num_portions = len(sorted_sugar_time_indices)

temps = np.array([])
temp = init_temp
t_segments = np.split(t, sorted_sugar_time_indices)
num_segments = len(t_segments)
for i in range(num_segments):
temp_segment = cube_after_cooling(t_segments[i],
temp,
float(cube_dectemp)/num_portions,
roomtemp,
k)
temps = np.append(temps, temp_segment)
temp = temp_segment[-1]
temps[-1] += float(cube_dectemp)/num_portions

return temps


Let’s calculate all the temperatures:

temps_cont = continuous_sugar_cooling(times, **cooling_parameters)
temps_n = sugar_specifiedtime_cooling(times, [50, 100, 150], **cooling_parameters)


And plot all the graphs together:

plt.plot(times, temps_beforecooling, label='cube before cooling')
plt.plot(times, temps_aftercooling, label='cube after cooling')
plt.plot(times, temps_cont, label='continuous')
plt.plot(times, temps_n, label='3 times of dropping cube')
plt.xlabel('Time (min)')
plt.ylabel('Coffee Temperature (degrees Celcius)')
plt.legend()


Complete code demonstration can be found in my github (GitHub/stephenhky/CoffeeCooling)

In my previous blog post, I introduced the newly emerged topological data analysis (TDA). Unlike most of the other data analytic algorithms, TDA, concerning the topology as its name tells, cares for the connectivity of points, instead of the distance (according to a metric, whether it is Euclidean, Manhattan, Minkowski or any other). What is the best tools to describe topology?

Physicists use a lot homotopy. But for the sake of computation, it is better to use a scheme that are suited for discrete computation. It turns out that there are useful tools in algebraic topology: homology. But to understand homology, we need to understand what a simplicial complex is.

Gunnar Carlsson [Carlsson 2009] and Afra Zomorodian [Zomorodian 2011] wrote good reviews about them, although from a different path in introducing the concept. I first followed Zomorodian’s review [Zomorodian 2011], then his book [Zomorodian 2009] that filled in a lot of missing links in his review, to a certain point. I recently started reading Carlsson’s review.

One must first understand what a simplicial complex is. Without giving too much technical details, simplicial complex is basically a shape connecting points together. A line is a 1-simplex, connecting two points. A triangle is a 2-simplex. A tetrahedron is a 3-complex. There are other more complicated and unnamed complexes. Any subsets of a simplicial complex are faces. For example, the sides of the triangle are faces. The faces and the sides are the faces of the tetrahedron. (Refer to Wolfram MathWorld for more details. There are a lot of good tutorials online.)

Implementing Simplicial Complex

We can easily encoded this into a python code. I wrote a class SimplicialComplex in Python to implement this. We first import necessary libraries:

import numpy as np
from itertools import combinations
from scipy.sparse import dok_matrix


The first line imports the numpy library, the second the iteration tools necessary for extracting the faces for simplicial complex, the third the sparse matrix implementation in the scipy library (applied on something that I will not go over in this blog entry), and the fourth for some reduce operation.

We want to describe the simplicial complexes in the order of some labels (which can be anything, such as integers or strings). If it is a point, then it can be represented as tuples, as below:

 (1,)

Or if it is a line (a 1-simplex), then

 (1, 2)

Or a 2-simplex as a triangle, then

 (1, 2, 3)

I think you get the gist. The integers 1, 2, or 3 here are simply labels. We can easily store this in the class:

class SimplicialComplex:
def __init__(self, simplices=[]):
self.import_simplices(simplices=simplices)

def import_simplices(self, simplices=[]):
self.simplices = map(lambda simplex: tuple(sorted(simplex)), simplices)
self.face_set = self.faces()


You might observe the last line of the codes above. And it is for calculating all the faces of this complex, and it is implemented in this way:

  def faces(self):
faceset = set()
for simplex in self.simplices:
numnodes = len(simplex)
for r in range(numnodes, 0, -1):
for face in combinations(simplex, r):
return faceset


The faces are intuitively sides of a 2D shape (2-simplex), or planes of a 3D shape (3-simplex). But the faces of a 3-simplex includes the faces of all its faces. All the faces are saved in a field called faceset. If the user wants to retrieve the faces in a particular dimension, they can call this method:

  def n_faces(self, n):
return filter(lambda face: len(face)==n+1, self.face_set)


There are other methods that I am not going over in this blog entry. Now let us demonstrate how to use the class by implementing a tetrahedron.

sc = SimplicialComplex([('a', 'b', 'c', 'd')])


If we want to extract the faces, then enter:

sc.faces()


which outputs:

{('a',),
('a', 'b'),
('a', 'b', 'c'),
('a', 'b', 'c', 'd'),
('a', 'b', 'd'),
('a', 'c'),
('a', 'c', 'd'),
('a', 'd'),
('b',),
('b', 'c'),
('b', 'c', 'd'),
('b', 'd'),
('c',),
('c', 'd'),
('d',)}


We have gone over the basis of simplicial complex, which is the foundation of TDA. We appreciate that the simplicial complex deals only with the connectivity of points instead of the distances between the points. And the homology groups will be calculated based on this. However, how do we obtain the simplicial complex from the discrete data we have? Zomorodian’s review [Zomorodian 2011] gave a number of examples, but I will only go through two of them only. And from this, you can see that to establish the connectivity between points, we still need to apply some sort of distance metrics.

Alpha Complex

An alpha complex is the nerve of the cover of the restricted Voronoi regions. (Refer the details to Zomorodian’s review [Zomorodian 2011], this Wolfram MathWorld entry, or this Wolfram Demonstration.) We can extend the class SimplicialComplex to get a class AlphaComplex:

from scipy.spatial import Delaunay, distance
from operator import or_
from functools import partial

def facesiter(simplex):
for i in range(len(simplex)):
yield simplex[:i]+simplex[(i+1):]

def flattening_simplex(simplices):
for simplex in simplices:
for point in simplex:
yield point

def get_allpoints(simplices):
return set(flattening_simplex(simplices))

def contain_detachededges(simplex, distdict, epsilon):
if len(simplex)==2:
return (distdict[simplex[0], simplex[1]] &gt; 2*epsilon)
else:
return reduce(or_, map(partial(contain_detachededges, distdict=distdict, epsilon=epsilon), facesiter(simplex)))

class AlphaComplex(SimplicialComplex):
def __init__(self, points, epsilon, labels=None, distfcn=distance.euclidean):
self.pts = points
self.labels = range(len(self.pts)) if labels==None or len(labels)!=len(self.pts) else labels
self.epsilon = epsilon
self.distfcn = distfcn
self.import_simplices(self.construct_simplices(self.pts, self.labels, self.epsilon, self.distfcn))

def calculate_distmatrix(self, points, labels, distfcn):
distdict = {}
for i in range(len(labels)):
for j in range(len(labels)):
distdict[(labels[i], labels[j])] = distfcn(points[i], points[j])
return distdict

def construct_simplices(self, points, labels, epsilon, distfcn):
delaunay = Delaunay(points)
delaunay_simplices = map(tuple, delaunay.simplices)
distdict = self.calculate_distmatrix(points, labels, distfcn)

simplices = []
for simplex in delaunay_simplices:
faces = list(facesiter(simplex))
detached = map(partial(contain_detachededges, distdict=distdict, epsilon=epsilon), faces)
if reduce(or_, detached):
if len(simplex)&gt;2:
for face, notkeep in zip(faces, detached):
if not notkeep:
simplices.append(face)
else:
simplices.append(simplex)
simplices = map(lambda simplex: tuple(sorted(simplex)), simplices)
simplices = list(set(simplices))

allpts = get_allpoints(simplices)
for point in (set(labels)-allpts):
simplices += [(point,)]

return simplices


The scipy package already has a package to calculate Delaunay region. The function contain_detachededges is for constructing the restricted Voronoi region from the calculated Delaunay region.

This class demonstrates how an Alpha Complex is constructed, but this runs slowly once the number of points gets big!

Vietoris-Rips (VR) Complex

Another commonly used complex is called the Vietoris-Rips (VR) Complex, which connects points as the edge of a graph if they are close enough. (Refer to Zomorodian’s review [Zomorodian 2011] or this Wikipedia page for details.) To implement this, import the famous networkx originally designed for network analysis.

import networkx as nx
from scipy.spatial import distance
from itertools import product

class VietorisRipsComplex(SimplicialComplex):
def __init__(self, points, epsilon, labels=None, distfcn=distance.euclidean):
self.pts = points
self.labels = range(len(self.pts)) if labels==None or len(labels)!=len(self.pts) else labels
self.epsilon = epsilon
self.distfcn = distfcn
self.network = self.construct_network(self.pts, self.labels, self.epsilon, self.distfcn)
self.import_simplices(map(tuple, list(nx.find_cliques(self.network))))

def construct_network(self, points, labels, epsilon, distfcn):
g = nx.Graph()
zips = zip(points, labels)
for pair in product(zips, zips):
if pair[0][1]!=pair[1][1]:
dist = distfcn(pair[0][0], pair[1][0])
if dist&lt;epsilon:
return g


The intuitiveness and efficiencies are the reasons that VR complexes are widely used.

For more details about the Alpha Complexes, VR Complexes and the related Čech Complexes, refer to this page.

More…

There are other commonly used complexes used, including Witness Complex, Cubical Complex etc., which I leave no introductions. Upon building the complexes, we can analyze the topology by calculating their homology groups, Betti numbers, the persistent homology etc. I wish to write more about it soon.

(Taken from http://latticeqcd.org/pythonorg/static/images/antigravity.png, adapted from http://xkcd.com/353/)

Python is the basic programming languages if one wants to work on data nowadays. Its popularity comes with its intuitive syntax, its support of several programming paradigms, and the package numpy (Numerical Python). Yes, if you asked which package is a “must-have” outside the standard Python packages, I would certainly name numpy.

Let me list some useful packages that I have found useful:

1. numpy: Numerical Python. Its basic data type is ndarray, which acts like a vector with vectorized calculation support. It makes Python to perform matrix calculation efficiently like MATLAB and Octave. It supports a lot of commonly used linear algebraic algorithms, such as eigenvalue problems, SVD etc. It is the basic of a lot of other Python packages that perform heavy numerical computation. It is such an important package that, in some operating systems, numpy comes with Python as well.
2. scipy: Scientific Python. It needs numpy, but it supports also sparse matrices, special functions, statistics, numerical integration…
3. matplotlib: Graph plotting.
4. scikit-learn: machine learning library. It contains a number of supervised and unsupervised learning algorithms.
5. nltk: natural language processing. It provides not only basic tools like stemmers, lemmatizers, but also some algorithms like maximum entropy, tf-idf vectorizer etc. It provides a few corpuses, and supports WordNet dictionary.
6. gensim: another useful natural language processing package with an emphasis on topic modeling. It mainly supports Word2Vec, latent semantic indexing (LSI), and latent Dirichlet allocation (LDA). It is convenient to construct term-document matrices, and convert them to matrices in numpy or scipy.
7. networkx: a package that supports both undirected and directed graphs. It provides basic algorithms used in graphs.
8. sympy: Symbolic Python. I am not good at this package, but I know mathics and SageMath are both based on it.
9. pandas: it supports data frame handling like R. (I have not used this package as I am a heavy R user.)

Of course, if you are a numerical developer, to save you a good life, install Anaconda.

There are some other packages that are useful, such as PyCluster (clustering), xlrd (Excel files read/write), PyGame (writing games)… But since I have not used them, I would rather mention it in this last paragraph, not to endorse but avoid devaluing it.

Don’t forget to type in your IPython Notebook:

import antigravity