Constructing Connectivities

In my previous blog post, I introduced the newly emerged topological data analysis (TDA). Unlike most of the other data analytic algorithms, TDA, concerning the topology as its name tells, cares for the connectivity of points, instead of the distance (according to a metric, whether it is Euclidean, Manhattan, Minkowski or any other). What is the best tools to describe topology?

Physicists use a lot homotopy. But for the sake of computation, it is better to use a scheme that are suited for discrete computation. It turns out that there are useful tools in algebraic topology: homology. But to understand homology, we need to understand what a simplicial complex is.

Gunnar Carlsson [Carlsson 2009] and Afra Zomorodian [Zomorodian 2011] wrote good reviews about them, although from a different path in introducing the concept. I first followed Zomorodian’s review [Zomorodian 2011], then his book [Zomorodian 2009] that filled in a lot of missing links in his review, to a certain point. I recently started reading Carlsson’s review.

One must first understand what a simplicial complex is. Without giving too much technical details, simplicial complex is basically a shape connecting points together. A line is a 1-simplex, connecting two points. A triangle is a 2-simplex. A tetrahedron is a 3-complex. There are other more complicated and unnamed complexes. Any subsets of a simplicial complex are faces. For example, the sides of the triangle are faces. The faces and the sides are the faces of the tetrahedron. (Refer to Wolfram MathWorld for more details. There are a lot of good tutorials online.)

Implementing Simplicial Complex

We can easily encoded this into a python code. I wrote a class SimplicialComplex in Python to implement this. We first import necessary libraries:

import numpy as np
from itertools import combinations
from scipy.sparse import dok_matrix
from operator import add

The first line imports the numpy library, the second the iteration tools necessary for extracting the faces for simplicial complex, the third the sparse matrix implementation in the scipy library (applied on something that I will not go over in this blog entry), and the fourth for some reduce operation.

We want to describe the simplicial complexes in the order of some labels (which can be anything, such as integers or strings). If it is a point, then it can be represented as tuples, as below:

 (1,) 

Or if it is a line (a 1-simplex), then

 (1, 2) 

Or a 2-simplex as a triangle, then

 (1, 2, 3) 

I think you get the gist. The integers 1, 2, or 3 here are simply labels. We can easily store this in the class:

class SimplicialComplex:
  def __init__(self, simplices=[]):
    self.import_simplices(simplices=simplices)

  def import_simplices(self, simplices=[]):
    self.simplices = map(lambda simplex: tuple(sorted(simplex)), simplices)
    self.face_set = self.faces()

You might observe the last line of the codes above. And it is for calculating all the faces of this complex, and it is implemented in this way:

  def faces(self):
    faceset = set()
    for simplex in self.simplices:
      numnodes = len(simplex)
      for r in range(numnodes, 0, -1):
        for face in combinations(simplex, r):
          faceset.add(face)
    return faceset

The faces are intuitively sides of a 2D shape (2-simplex), or planes of a 3D shape (3-simplex). But the faces of a 3-simplex includes the faces of all its faces. All the faces are saved in a field called faceset. If the user wants to retrieve the faces in a particular dimension, they can call this method:

  def n_faces(self, n):
    return filter(lambda face: len(face)==n+1, self.face_set)

There are other methods that I am not going over in this blog entry. Now let us demonstrate how to use the class by implementing a tetrahedron.

sc = SimplicialComplex([('a', 'b', 'c', 'd')])

If we want to extract the faces, then enter:

sc.faces()

which outputs:

{('a',),
 ('a', 'b'),
 ('a', 'b', 'c'),
 ('a', 'b', 'c', 'd'),
 ('a', 'b', 'd'),
 ('a', 'c'),
 ('a', 'c', 'd'),
 ('a', 'd'),
 ('b',),
 ('b', 'c'),
 ('b', 'c', 'd'),
 ('b', 'd'),
 ('c',),
 ('c', 'd'),
 ('d',)}

We have gone over the basis of simplicial complex, which is the foundation of TDA. We appreciate that the simplicial complex deals only with the connectivity of points instead of the distances between the points. And the homology groups will be calculated based on this. However, how do we obtain the simplicial complex from the discrete data we have? Zomorodian’s review [Zomorodian 2011] gave a number of examples, but I will only go through two of them only. And from this, you can see that to establish the connectivity between points, we still need to apply some sort of distance metrics.

Alpha Complex

An alpha complex is the nerve of the cover of the restricted Voronoi regions. (Refer the details to Zomorodian’s review [Zomorodian 2011], this Wolfram MathWorld entry, or this Wolfram Demonstration.) We can extend the class SimplicialComplex to get a class AlphaComplex:

from scipy.spatial import Delaunay, distance
from operator import or_
from functools import partial

def facesiter(simplex):
  for i in range(len(simplex)):
    yield simplex[:i]+simplex[(i+1):]

def flattening_simplex(simplices):
  for simplex in simplices:
    for point in simplex:
      yield point

def get_allpoints(simplices):
  return set(flattening_simplex(simplices))

def contain_detachededges(simplex, distdict, epsilon):
  if len(simplex)==2:
    return (distdict[simplex[0], simplex[1]] > 2*epsilon)
  else:
    return reduce(or_, map(partial(contain_detachededges, distdict=distdict, epsilon=epsilon), facesiter(simplex)))

class AlphaComplex(SimplicialComplex):
  def __init__(self, points, epsilon, labels=None, distfcn=distance.euclidean):
    self.pts = points
    self.labels = range(len(self.pts)) if labels==None or len(labels)!=len(self.pts) else labels
    self.epsilon = epsilon
    self.distfcn = distfcn
    self.import_simplices(self.construct_simplices(self.pts, self.labels, self.epsilon, self.distfcn))

  def calculate_distmatrix(self, points, labels, distfcn):
    distdict = {}
    for i in range(len(labels)):
      for j in range(len(labels)):
        distdict[(labels[i], labels[j])] = distfcn(points[i], points[j])
    return distdict

  def construct_simplices(self, points, labels, epsilon, distfcn):
    delaunay = Delaunay(points)
    delaunay_simplices = map(tuple, delaunay.simplices)
    distdict = self.calculate_distmatrix(points, labels, distfcn)

    simplices = []
    for simplex in delaunay_simplices:
      faces = list(facesiter(simplex))
      detached = map(partial(contain_detachededges, distdict=distdict, epsilon=epsilon), faces)
      if reduce(or_, detached):
        if len(simplex)>2:
          for face, notkeep in zip(faces, detached):
            if not notkeep:
              simplices.append(face)
      else:
        simplices.append(simplex)
    simplices = map(lambda simplex: tuple(sorted(simplex)), simplices)
    simplices = list(set(simplices))

    allpts = get_allpoints(simplices)
    for point in (set(labels)-allpts):
      simplices += [(point,)]

    return simplices

The scipy package already has a package to calculate Delaunay region. The function contain_detachededges is for constructing the restricted Voronoi region from the calculated Delaunay region.

This class demonstrates how an Alpha Complex is constructed, but this runs slowly once the number of points gets big!

Vietoris-Rips (VR) Complex

Another commonly used complex is called the Vietoris-Rips (VR) Complex, which connects points as the edge of a graph if they are close enough. (Refer to Zomorodian’s review [Zomorodian 2011] or this Wikipedia page for details.) To implement this, import the famous networkx originally designed for network analysis.

import networkx as nx
from scipy.spatial import distance
from itertools import product

class VietorisRipsComplex(SimplicialComplex):
  def __init__(self, points, epsilon, labels=None, distfcn=distance.euclidean):
    self.pts = points
    self.labels = range(len(self.pts)) if labels==None or len(labels)!=len(self.pts) else labels
    self.epsilon = epsilon
    self.distfcn = distfcn
    self.network = self.construct_network(self.pts, self.labels, self.epsilon, self.distfcn)
    self.import_simplices(map(tuple, list(nx.find_cliques(self.network))))

  def construct_network(self, points, labels, epsilon, distfcn):
    g = nx.Graph()
    g.add_nodes_from(labels)
    zips = zip(points, labels)
    for pair in product(zips, zips):
      if pair[0][1]!=pair[1][1]:
        dist = distfcn(pair[0][0], pair[1][0])
        if dist<epsilon:
          g.add_edge(pair[0][1], pair[1][1])
    return g

The intuitiveness and efficiencies are the reasons that VR complexes are widely used.

For more details about the Alpha Complexes, VR Complexes and the related Čech Complexes, refer to this page.

More…

There are other commonly used complexes used, including Witness Complex, Cubical Complex etc., which I leave no introductions. Upon building the complexes, we can analyze the topology by calculating their homology groups, Betti numbers, the persistent homology etc. I wish to write more about it soon.

Taken from Wolfram Mathworld
Taken from Wolfram Mathworld

Continue reading “Constructing Connectivities”

Starting the Journey of Topological Data Analysis (TDA)

Topology has been around for centuries, but it did not catch the attention of many data analysts until recently. In an article published in Nature Scientific Reports, the authors demonstrated the power of topology in data analysis through examples including gene expression from breast rumors, voting data in the United States, and player performance data from the NBA. [Lum et. al. 2013]

As an introduction, they described topological methods “as a geometric approach to pattern or shape recognition within data.” It is true that in machine learning, we never care enough pattern recognition, but topology adds insights regarding the shapes of data that do not change with continuous deformation. For example, a circle and an ellipse have “the same topology.” The distances between data points are not as important as the shape. Traditional machine learning methods deal with feature vectors, distances, or classifications, but the topology of the data is usually discarded. Gunnar Carlsson demonstrated in a blog that a thin ellipse of data may be misrepresented as two straight parallel lines or one straight lines. [Carlsson 2015] Dimensionality reduction algorithms such as principal component analysis (PCA) often disregard the topology as well. (I heard that Kohenen’s self-organizing maps (SOM) [Kohonen 2000] retain the topology of higher dimensional data during the dimensionality reduction, but I am not confident enough to say that.)

Euler introduced the concept of topology in the 18th century. Topology has been a big subject in physics since 1950s. The string theory, as one of the many efforts in unifying gravity and other three fundamental forces, employs topological dimensions. In condensed matter physics, the fractional quantum Hall effect is a topological quantum effect. There are topological solitons [Rajaraman 1987] such as quantum vortices in superfluids, [Simula, Blakie 2006; Calzetta, Ho, Hu 2010] columns of topological solitons (believed to be Skyrmions) in helical magnets, [Mühlbauer et. al. 2009; Ho et. al. 2010; Ho 2012] hexagonal solitonic objects in smectic liquid crystals [Matsumoto et. al. 2009]… When a field becomes sophisticated, it becomes quantitative; when a quantitative field becomes sophisticated, it requires abstract mathematics such as topology for a general description. I believe analysis on any kinds of data is no exception.

There are some good reviews and readings about topological data analysis (TDA) out there, for example, the ones by Gunnar Carlsson [Carlsson 2009] and Afra Zomorodian [Zomorodian 2011]. While physicists talk about homotopy, data analysts talk about persistent homology as it is easier to compute. Data have to be described in a simplicial complex or a graph/network. Then the homology can be computed and represented in various ways such as barcodes. [Ghrist 2008] Then we extract insights about the data from it.

Topology has a steep learning curve. I am also a starter learning about this. This blog entry will not be the last talking about TDA. Therefore, I opened a new session called TDA for all of my blog entries about it. Let’s start the journey!

There is an R package called “TDA” that facilitates topological data analysis. [Fasy et. al. 2014] A taste of homology of a simplicial complex is also demonstrated in a Wolfram demo.

New-big-data-firm-to-pion-008
(Taken from TheGuardian)

Continue reading “Starting the Journey of Topological Data Analysis (TDA)”

Blog at WordPress.com.

Up ↑