Graph Convolutional Neural Network (Part I)

Suggested by some friends, I have been reading graph convolutional neural network. This is not a recent interest, as I have been interested in some graph-related problems (as I developed the graphflow package) and topological data analysis (TDA, as I developed the mogutda package). Graph theory is not a new field as well, but it is interesting to see how it is being applied in deep learning, as a lot of real-world relational data can be expressed in terms of graphs. (Examples: word relations, organizational structure, genes, internet etc.) With the increasing attention to neo4j, we know graph is coming to be hot.

It would be helpful to review some basic graph theory:

  • A graph is represented by \mathcal{G} = (V, E), where V and E are the nodes (vertices) and edges respectively.
  • A graph can be directed or undirected. In graph convolutional neural network, they are undirected usually.
  • The adjacency matrix A describes how nodes are connected: A_{ij} = 1 if there is an edge connecting from node i to node j, and 0 otherwise. A is a symmetric matrix for an undirected graph.
  • The incidence matrix B is another way to describe how nodes are connected: B_{ij} = 1 if a node i is connected with edge j. This is useful for undirected graph.
  • The degree matrix D is a diagonal matrix, with elements D_{ii} denotes the number of neighbors for node i in undirected matrix.
  • The function acting on the nodes is called the filter.
  • The graph Laplacian, or Kirchhoff matrix, is defined by L = D - A, and the normalized graph Laplacian is \tilde{L} = I - D^{-\frac{1}{2}} A D^{-\frac{1}{2}}.

The graph Laplacian is the most important matrix in graph convolutional neural network. It is analogous to the Laplacian operator in Euclidean space, \nabla^2. The reader can easily verify this by constructing a graph of 2D lattice and compute the graph Laplacian matrix, and find that it is the same as the discretized Laplacian operator.

We can also get some insights from the Euclidean analogue. In physics, the solution to the Laplacian equation is harmonic: the basis of the solution can be described in the spectral/Fourier space, as:

\nabla^2 \psi = - \omega^2 \psi,

And \psi \propto e^{\pm i \omega x}. In graph convolutional neural network, as Bruna et. al. suggested in 2013, the graph is calculated in the graph Fourier space, instead of directly dealing with the Laplacian matrix in all layers of network.

On the other hand, we know that for the convolution

(f \circ g) (x) = \int dy f(x-y) g(y),

its Fourier transform is given by

\tilde{( f \circ g)} (\omega) = \tilde{f} (\omega) \tilde{g} (\omega).

In Fourier space, the convolution of two functions are just their products. Similarly, in graph convolutional neural network, convolutions can be computed in the Fourier space as the mere product of two filters in the Fourier space. More specifically, for finding the convolution of the filters f and g, with \Phi being the unitary eigenmatrix,

f \circ g = \Phi ((\Phi^T g) \circ (\Phi^T f)) = \Phi \textbf{diag}(g_1, g_2, \ldots, g_n) f.

However, such general description is basis-dependent, and it is still computationally expensive. More work has been proposed to smooth the representation, which will be covered in the upcoming blogs.

A side note, the readers can verify themselves that

\sum_{ij} A_{ij} | f_i - g_j |^2 = f \tilde{L} g

Continue reading “Graph Convolutional Neural Network (Part I)”

Essential Python Packages

Almost three years ago, I wrote a blog entry titled Useful Python Packages, which listed the essential packages that I deemed important. How has the list been changed over the past three years?

First of all, three years ago, most people were still writing Python 2.7. But now there is a trend to switch to Python 3. I admitted that I still have not started the switch yet, but in the short term, I will have no choice and I will.

What are some of the essential packages?
Numerical Packages

  • numpy: numerical Python, containing most basic numerical routines such as matrix manipulation, linear algebra, random sampling, numerical integration etc. There is a built-in wrapper for Fortran as well. Actually, numpy is so important that some Linux system includes it with Python.
  • scipy: scientific Python, containing some functions useful for scientific computing, such as sparse matrices, numerical differential equations, advanced linear algebra, special functions etc.
  • networkx: package that handles various types of networks
  • PuLP: linear programming
  • cvxopt: convex optimization

Data Visualization

  • matplotlib: basic plotting.
  • ggplot2: the ggplot2 counterpart in Python for producing quality publication plots.

Data Manipulation

  • pandas: data manipulation, working with data frames in Python, and save/load of various formats such as CSV and Excel

Machine Learning

  • scikit-learn: machine-learning library in Python, containing classes and functions for supervised and unsupervised learning

Probabilistic Programming

  • PyMC: Metropolis-Hasting algorithm
  • Edward: deep probabilistic programing

Deep Learning Frameworks

  • TensorFlow: because of Google’s marketing effort, TensorFlow is now the industrial standard for building deep learning networks, with rich source of mathematical functions, esp. for neural network cells, with GPU capability
  • Keras: containing routines of high-level layers for deep learning neural networks, with TensorFlow, Theano, or CNTK as the backbone
  • PyTorch: a rivalry against TensorFlow

Natural Language Processing

  • nltk: natural language processing toolkit for Python, containing bag-of-words model, tokenizer, stemmers, chunker, lemmatizers, part-of-speech taggers etc.
  • gensim: a useful natural language processing package useful for topic modeling, word-embedding, latent semantic indexing etc., running in a fast fashion
  • shorttext: text mining package good for handling short sentences, that provide high-level routines for training neural network classifiers, or generating feature represented by topic models or autoencodings.
  • spacy: industrial standard for natural language processing common tools

GUI

I can probably list more, but I think I covered most of them. If you do not find something useful, it is probably time for you to write a brand new package.

Terrorism, Polarization, and Social Influences

Eiffel Tower, Paris (taken from history.com)
Eiffel Tower, Paris (taken from history.com)

Paris attack on Nov 13, 2015 shocked the Earth. And the discussions about terrorism, nationalism, imperialism, religious institutions and fundamentalism is getting heated again. However, there have been some social scientists who argue that these are not the core roots, but the social networks.

Cass Sunstein, a Professor at Harvard University , discussed about it right after the 9/11 attack. He recently reposted his article, titled “Why They Hate Us,” on LinkedIn. [Sunstein 2015] He argued that terrorism, driven by an ideology in a group, is made, not born. And the influence is through social networks. It is the corporate ideology that induces polarization. Polarization is intentional. Terrorism is a by-product of freedom of speech.

The widespread use of social media like Twitter and Facebook enhanced the polarization (of any political or religious ideologies). Social media might account a lot the political polarization happening in the States too. Social networks are significant in spreading ideas, as discussed in my previous blog entry. [Ho 2015] [Centola 2015] [Granovetter 1973] There are certainly a lot of values in this viewpoint, although I am not capable of making a judgement.

I highly recommend you to read Sunstein’s article.

Scientists can computationally analyze social networks using the Python package networkx. [Tsvetovat & Kouznetsov 2011]

Continue reading “Terrorism, Polarization, and Social Influences”

Homology and Betti Numbers

We have been talking about the elements of topological data analysis. In my previous post, I introduced simplicial complexes, concerning the ways to connect points together. In topology, it is the shape and geometry, not distances, which matter ( although while constructing the distance does play a role).

With the simplicial complexes, we can go ahead to describe its topology. We will use the techniques in algebraic topology without going into too much details. The techniques involves homology, but a full explanation of it requires the concepts of normal subgroup, kernel, image, quotient group in group theory. I will not talk about them, although I admit that there is no easy ways to talk about computational topology without touching them. I highly recommend the readers can refer to Zomorodian’s textbook for more details. [Zomorodian 2009]

I will continue with the Python class

SimplicialComplex

that I wrote in the previous blog post. Suppose we have an k-simplex, then the n-th face is any combinations with n+1 vertices. A simplicial complex is such that a face contained in a face is also a face of the complex. In this, we can define the boundary operator by

\partial_k \sigma = \sum_i (-1)^i [v_0 v_1 \ldots \hat{v}_i \ldots v_k],

where \hat{v}_i indicates the i-th vertex be removed. This operator gives all the boundary faces of a face \sigma. The faces being operated are k-faces, and this operator will be mapped to a (k-1)-faces. Then the boundary operator can be seen as a (n_k \times n_{k-1})-matrix, where n_k is the number of k-faces. This can be easily calculated with the following method:

class SimplicialComplex:
  ...
  def boundary_operator(self, i):
    source_simplices = self.n_faces(i)
    target_simplices = self.n_faces(i-1)

    if len(target_simplices)==0:
      S = dok_matrix((1, len(source_simplices)), dtype=np.float32)
      S[0, 0:len(source_simplices)] = 1
    else:
      source_simplices_dict = {}
      for j in range(len(source_simplices)):
        source_simplices_dict[source_simplices[j]] = j
      target_simplices_dict = {}
      for i in range(len(target_simplices)):
        target_simplices_dict[target_simplices[i]] = i

      S = dok_matrix((len(target_simplices), len(source_simplices)), dtype=np.float32)
      for source_simplex in source_simplices:
        for a in range(len(source_simplex)):
          target_simplex = source_simplex[:a]+source_simplex[(a+1):]
          i = target_simplices_dict[target_simplex]
          j = source_simplices_dict[source_simplex]
          S[i, j] = -1 if a % 2==1 else 1 # S[i, j] = (-1)**a
   return S

With the boundary operator, we can calculate the Betti numbers that characterize uniquely the topology of the shapes. Actually it involves the concept of homology groups that we are going to omit. To calculate the k-th Betti numbers, we calculate:

\beta_k = \text{rank} (\text{ker} \partial_k) - \text{rank} (\text{Im} \partial_{k+1}).

By rank-nullity theorem, [Jackson]

\text{rank} (\text{ker} \partial_k) +\text{rank} (\text{Im} \partial_k) = \text{dim} (\partial_k)

the Betti number is then

\beta_k = \text{dim} (\partial_k) - \text{rank}(\text{Im} \partial_k)) - \text{rank} (\text{Im} \partial_{k+1})

where the rank of the image of an operator can be easily computed using the rank method available in numpy. Then the method of calculating the Betti number is

class SimplicialComplex:
  ...
  def betti_number(self, i):
    boundop_i = self.boundary_operator(i)
    boundop_ip1 = self.boundary_operator(i+1)

    if i==0:
      boundop_i_rank = 0
    else:
      try:
        boundop_i_rank = np.linalg.matrix_rank(boundop_i.toarray())
      except np.linalg.LinAlgError:
        boundop_i_rank = boundop_i.shape[1]
    try:
      boundop_ip1_rank = np.linalg.matrix_rank(boundop_ip1.toarray())
    except np.linalg.LinAlgError:
      boundop_ip1_rank = boundop_ip1.shape[1]

    return ((boundop_i.shape[1]-boundop_i_rank)-boundop_ip1_rank)

If we draw a simplicial complex on a 2-dimensional plane, we almost have \beta_0, \beta_1 and \beta_2. $\beta_0$ indicates the number of components, \beta_1 the number of bases for a tunnel, and \beta_2 the number of voids.

Let’s have some examples. Suppose we have a triangle, not filled.

e1 = [(0, 1), (1, 2), (2, 0)]
sc1 = SimplicialComplex(e1)

Then the Betti numbers are:


In [5]: sc1.betti_number(0)
Out[5]: 1

In [6]: sc1.betti_number(1)
Out[6]: 1

In [7]: sc1.betti_number(2)
Out[7]: 0

Let’s try another example with multiple components.

e2 = [(1,2), (2,3), (3,1), (4,5,6), (6,7), (7,4)]
sc2 = SimplicialComplex(e2)

We can graphically represent it using networkx:

import networkx as nx
import matplotlib.pyplot as plt
n2 = nx.Graph()
n2.add_edges_from(sc2.n_faces(1))
nx.draw(n2)
plt.show()
Simplicial Complex of e2
Simplicial Complex of e2

And its Betti numbers are as follow:


In [13]: sc2.betti_number(0)
Out[13]: 2

In [14]: sc2.betti_number(1)
Out[14]: 2

In [15]: sc2.betti_number(2)
Out[15]: 0

A better illustration is the Wolfram Demonstration, titled “Simplicial Homology of the Alpha Complex”.

On top of the techniques in this current post, we can describe the homology of discrete points using persistent homology, which I will describe in my future posts. I will probably spend a post on homotopy in comparison to other types of quantitative problems.

Continue reading “Homology and Betti Numbers”

Constructing Connectivities

In my previous blog post, I introduced the newly emerged topological data analysis (TDA). Unlike most of the other data analytic algorithms, TDA, concerning the topology as its name tells, cares for the connectivity of points, instead of the distance (according to a metric, whether it is Euclidean, Manhattan, Minkowski or any other). What is the best tools to describe topology?

Physicists use a lot homotopy. But for the sake of computation, it is better to use a scheme that are suited for discrete computation. It turns out that there are useful tools in algebraic topology: homology. But to understand homology, we need to understand what a simplicial complex is.

Gunnar Carlsson [Carlsson 2009] and Afra Zomorodian [Zomorodian 2011] wrote good reviews about them, although from a different path in introducing the concept. I first followed Zomorodian’s review [Zomorodian 2011], then his book [Zomorodian 2009] that filled in a lot of missing links in his review, to a certain point. I recently started reading Carlsson’s review.

One must first understand what a simplicial complex is. Without giving too much technical details, simplicial complex is basically a shape connecting points together. A line is a 1-simplex, connecting two points. A triangle is a 2-simplex. A tetrahedron is a 3-complex. There are other more complicated and unnamed complexes. Any subsets of a simplicial complex are faces. For example, the sides of the triangle are faces. The faces and the sides are the faces of the tetrahedron. (Refer to Wolfram MathWorld for more details. There are a lot of good tutorials online.)

Implementing Simplicial Complex

We can easily encoded this into a python code. I wrote a class SimplicialComplex in Python to implement this. We first import necessary libraries:

import numpy as np
from itertools import combinations
from scipy.sparse import dok_matrix
from operator import add

The first line imports the numpy library, the second the iteration tools necessary for extracting the faces for simplicial complex, the third the sparse matrix implementation in the scipy library (applied on something that I will not go over in this blog entry), and the fourth for some reduce operation.

We want to describe the simplicial complexes in the order of some labels (which can be anything, such as integers or strings). If it is a point, then it can be represented as tuples, as below:

 (1,) 

Or if it is a line (a 1-simplex), then

 (1, 2) 

Or a 2-simplex as a triangle, then

 (1, 2, 3) 

I think you get the gist. The integers 1, 2, or 3 here are simply labels. We can easily store this in the class:

class SimplicialComplex:
  def __init__(self, simplices=[]):
    self.import_simplices(simplices=simplices)

  def import_simplices(self, simplices=[]):
    self.simplices = map(lambda simplex: tuple(sorted(simplex)), simplices)
    self.face_set = self.faces()

You might observe the last line of the codes above. And it is for calculating all the faces of this complex, and it is implemented in this way:

  def faces(self):
    faceset = set()
    for simplex in self.simplices:
      numnodes = len(simplex)
      for r in range(numnodes, 0, -1):
        for face in combinations(simplex, r):
          faceset.add(face)
    return faceset

The faces are intuitively sides of a 2D shape (2-simplex), or planes of a 3D shape (3-simplex). But the faces of a 3-simplex includes the faces of all its faces. All the faces are saved in a field called faceset. If the user wants to retrieve the faces in a particular dimension, they can call this method:

  def n_faces(self, n):
    return filter(lambda face: len(face)==n+1, self.face_set)

There are other methods that I am not going over in this blog entry. Now let us demonstrate how to use the class by implementing a tetrahedron.

sc = SimplicialComplex([('a', 'b', 'c', 'd')])

If we want to extract the faces, then enter:

sc.faces()

which outputs:

{('a',),
 ('a', 'b'),
 ('a', 'b', 'c'),
 ('a', 'b', 'c', 'd'),
 ('a', 'b', 'd'),
 ('a', 'c'),
 ('a', 'c', 'd'),
 ('a', 'd'),
 ('b',),
 ('b', 'c'),
 ('b', 'c', 'd'),
 ('b', 'd'),
 ('c',),
 ('c', 'd'),
 ('d',)}

We have gone over the basis of simplicial complex, which is the foundation of TDA. We appreciate that the simplicial complex deals only with the connectivity of points instead of the distances between the points. And the homology groups will be calculated based on this. However, how do we obtain the simplicial complex from the discrete data we have? Zomorodian’s review [Zomorodian 2011] gave a number of examples, but I will only go through two of them only. And from this, you can see that to establish the connectivity between points, we still need to apply some sort of distance metrics.

Alpha Complex

An alpha complex is the nerve of the cover of the restricted Voronoi regions. (Refer the details to Zomorodian’s review [Zomorodian 2011], this Wolfram MathWorld entry, or this Wolfram Demonstration.) We can extend the class SimplicialComplex to get a class AlphaComplex:

from scipy.spatial import Delaunay, distance
from operator import or_
from functools import partial

def facesiter(simplex):
  for i in range(len(simplex)):
    yield simplex[:i]+simplex[(i+1):]

def flattening_simplex(simplices):
  for simplex in simplices:
    for point in simplex:
      yield point

def get_allpoints(simplices):
  return set(flattening_simplex(simplices))

def contain_detachededges(simplex, distdict, epsilon):
  if len(simplex)==2:
    return (distdict[simplex[0], simplex[1]] > 2*epsilon)
  else:
    return reduce(or_, map(partial(contain_detachededges, distdict=distdict, epsilon=epsilon), facesiter(simplex)))

class AlphaComplex(SimplicialComplex):
  def __init__(self, points, epsilon, labels=None, distfcn=distance.euclidean):
    self.pts = points
    self.labels = range(len(self.pts)) if labels==None or len(labels)!=len(self.pts) else labels
    self.epsilon = epsilon
    self.distfcn = distfcn
    self.import_simplices(self.construct_simplices(self.pts, self.labels, self.epsilon, self.distfcn))

  def calculate_distmatrix(self, points, labels, distfcn):
    distdict = {}
    for i in range(len(labels)):
      for j in range(len(labels)):
        distdict[(labels[i], labels[j])] = distfcn(points[i], points[j])
    return distdict

  def construct_simplices(self, points, labels, epsilon, distfcn):
    delaunay = Delaunay(points)
    delaunay_simplices = map(tuple, delaunay.simplices)
    distdict = self.calculate_distmatrix(points, labels, distfcn)

    simplices = []
    for simplex in delaunay_simplices:
      faces = list(facesiter(simplex))
      detached = map(partial(contain_detachededges, distdict=distdict, epsilon=epsilon), faces)
      if reduce(or_, detached):
        if len(simplex)>2:
          for face, notkeep in zip(faces, detached):
            if not notkeep:
              simplices.append(face)
      else:
        simplices.append(simplex)
    simplices = map(lambda simplex: tuple(sorted(simplex)), simplices)
    simplices = list(set(simplices))

    allpts = get_allpoints(simplices)
    for point in (set(labels)-allpts):
      simplices += [(point,)]

    return simplices

The scipy package already has a package to calculate Delaunay region. The function contain_detachededges is for constructing the restricted Voronoi region from the calculated Delaunay region.

This class demonstrates how an Alpha Complex is constructed, but this runs slowly once the number of points gets big!

Vietoris-Rips (VR) Complex

Another commonly used complex is called the Vietoris-Rips (VR) Complex, which connects points as the edge of a graph if they are close enough. (Refer to Zomorodian’s review [Zomorodian 2011] or this Wikipedia page for details.) To implement this, import the famous networkx originally designed for network analysis.

import networkx as nx
from scipy.spatial import distance
from itertools import product

class VietorisRipsComplex(SimplicialComplex):
  def __init__(self, points, epsilon, labels=None, distfcn=distance.euclidean):
    self.pts = points
    self.labels = range(len(self.pts)) if labels==None or len(labels)!=len(self.pts) else labels
    self.epsilon = epsilon
    self.distfcn = distfcn
    self.network = self.construct_network(self.pts, self.labels, self.epsilon, self.distfcn)
    self.import_simplices(map(tuple, list(nx.find_cliques(self.network))))

  def construct_network(self, points, labels, epsilon, distfcn):
    g = nx.Graph()
    g.add_nodes_from(labels)
    zips = zip(points, labels)
    for pair in product(zips, zips):
      if pair[0][1]!=pair[1][1]:
        dist = distfcn(pair[0][0], pair[1][0])
        if dist<epsilon:
          g.add_edge(pair[0][1], pair[1][1])
    return g

The intuitiveness and efficiencies are the reasons that VR complexes are widely used.

For more details about the Alpha Complexes, VR Complexes and the related Čech Complexes, refer to this page.

More…

There are other commonly used complexes used, including Witness Complex, Cubical Complex etc., which I leave no introductions. Upon building the complexes, we can analyze the topology by calculating their homology groups, Betti numbers, the persistent homology etc. I wish to write more about it soon.

Taken from Wolfram Mathworld
Taken from Wolfram Mathworld

Continue reading “Constructing Connectivities”

Spread of Ideas in Social Networks

Social networks have existed for millennia. In schools, fraternities, clubs and associations form various networks within the campus. In job hunting, networking is essential. Ideas spread across various academic circles, while within a school of thought people have some common ideas. Various intelligence agencies study extensively a terrorist organization by understanding their network structure.

Recently, Damon Centola from University of Pennsylvania studied how social networks form and what that means for the ideas that will spread across them. [Centola 2015] It is based on a study by social theorists Peter Blau and Joseph Schwartz in 1984, who argued that a society with eliminated group boundaries enjoys the greatest level of social integration. [Blau & Schwarts 1984] These group boundaries are due to differences in cultures, races, religions, income, levels of education, hobbies, political party etc. Their study implied that a totally mixed society has the greatest level of social integration. However, Centola’s study built on this idea and developed further. He found that a society with completely eliminated boundaries ultimately reduces social integration.

In a diverse society like America, we wish to achieve total social integration to allow the widest spread of complex ideas. However, Centola’s finding indicates that while a total segregation is not desired, a moderate boundary is needed for social integration. Associations that are based on cultures, races, hobbies etc. are actually essential for the development of societies and spread of ideas.

There are many other similar studies in the past. In a celebrated paper authored by Mark Granovetter, the impact of “weak ties” are strong on the diffusion of influence and information. [Gravovetter 1973] The study by Thomas Schelling, a Nobel Laureate in Economics, also studied that the complete tolerance does not mean social stability. [Schelling 1978]

Analytics researchers can study the social network computationally using the Python package networkx.

(Taken from Phys.org)

Continue reading “Spread of Ideas in Social Networks”

Useful Python Packages

python
(Taken from http://latticeqcd.org/pythonorg/static/images/antigravity.png, adapted from http://xkcd.com/353/)

Python is the basic programming languages if one wants to work on data nowadays. Its popularity comes with its intuitive syntax, its support of several programming paradigms, and the package numpy (Numerical Python). Yes, if you asked which package is a “must-have” outside the standard Python packages, I would certainly name numpy.

Let me list some useful packages that I have found useful:

  1. numpy: Numerical Python. Its basic data type is ndarray, which acts like a vector with vectorized calculation support. It makes Python to perform matrix calculation efficiently like MATLAB and Octave. It supports a lot of commonly used linear algebraic algorithms, such as eigenvalue problems, SVD etc. It is the basic of a lot of other Python packages that perform heavy numerical computation. It is such an important package that, in some operating systems, numpy comes with Python as well.
  2. scipy: Scientific Python. It needs numpy, but it supports also sparse matrices, special functions, statistics, numerical integration…
  3. matplotlib: Graph plotting.
  4. scikit-learn: machine learning library. It contains a number of supervised and unsupervised learning algorithms.
  5. nltk: natural language processing. It provides not only basic tools like stemmers, lemmatizers, but also some algorithms like maximum entropy, tf-idf vectorizer etc. It provides a few corpuses, and supports WordNet dictionary.
  6. gensim: another useful natural language processing package with an emphasis on topic modeling. It mainly supports Word2Vec, latent semantic indexing (LSI), and latent Dirichlet allocation (LDA). It is convenient to construct term-document matrices, and convert them to matrices in numpy or scipy.
  7. networkx: a package that supports both undirected and directed graphs. It provides basic algorithms used in graphs.
  8. sympy: Symbolic Python. I am not good at this package, but I know mathics and SageMath are both based on it.
  9. pandas: it supports data frame handling like R. (I have not used this package as I am a heavy R user.)

Of course, if you are a numerical developer, to save you a good life, install Anaconda.

There are some other packages that are useful, such as PyCluster (clustering), xlrd (Excel files read/write), PyGame (writing games)… But since I have not used them, I would rather mention it in this last paragraph, not to endorse but avoid devaluing it.

Don’t forget to type in your IPython Notebook:

import antigravity

Continue reading “Useful Python Packages”

Ranking Everything: an Overview of Link Analysis Using PageRank Algorithm

This is an age of quantification, meaning that we want to give everything, even qualitative, a number. In schools, teachers measure how good their students master mathematics by grading, or scoring their homework. The funding agencies measure how good a scientist is by counting the number of his publications, the citations, and the impact factors. We measure how successful a person is by his annual income. We can question all these approaches of measurement. Yet however good or bad the measures are, we look for a metric to measure.

Original PageRank Algorithm

We measure webpages too. In the early ages of Internet, people performed searching on sites such as Yahoo or AltaVista. The keywords they entered are the main information for the browser to do the searching. However, a big problem was that a large number of low quality or irrelevant webpages showed up in search results. Some were due to malicious manipulation of keyword tricks. Therefore, it gave rise a need to rank the webpages. Larry Page and Sergey Brin, the founders of Google, tackled this problem as a thesis topic in Stanford University. But this got commercialized, and Brin never received his Ph.D. They published their algorithm, called PageRank, named after Larry Page, at the Seventh International World Wide Web Conference (WWW7) in April 1998. [Brin & Page 1998] This algorithm is regarded as one of the top ten algorithms in data mining by a survey paper published in the IEEE International Conference on Data Mining (ICDM) in December 2006. [Wu et. al. 2008]

Google-s-Larry-Page-and-Sergey-Brin-Are-3-2-19-Billion-Richer-in-One-Day-392729-2
Larry Page and Sergey Brin (source)

The idea of the PageRank algorithm is very simple. It regards each webpage as a node, and each link in the webpage as a directional edge from the source to the target webpage. This forms a network, or a directed graph, of webpages connected by their links. A link is seen as a vote to the target homepage, and if the source homepage ranks high, it enhances the target homepage’s ranking as well. Mathematically it involves solving a large matrix using Newton-Raphson’s method. (Technologies involving handling the large matrix led to the MapReduce programming paradigm, another big data trend nowadays.)

figure_1_webnet
Example (made by Python with packages networkx and matplotlib)

Let’s have an intuition through an example. In the network, we can easily see that “Big Data 1” has the highest rank because it has the most edges pointing to it. However, there are pages such as “Big Data Fake 1,” which looks like a big data page, but in fact it points to “Porn 1.” After running the PageRank algorithm, it does not have a high rank. The sample of the output is:

[('Big Data 1', 0.00038399273501500979),
('Artificial Intelligence', 0.00034612564364377323),
('Deep Learning 1', 0.00034221161094691966),
('Machine Learning 1', 0.00034177713235138173),
('Porn 1', 0.00033859136614724074),
('Big Data 2', 0.00033182629176238337),
('Spark', 0.0003305912073357307),
('Hadoop', 0.00032928389859040422),
('Dow-Jones 1', 0.00032368956852396916),
('Big Data 3', 0.00030969537721207128),
('Porn 2', 0.00030969537721207128),
('Big Data Fake 1', 0.00030735245262038724),
('Dow-Jones 2', 0.00030461420169420618),
('Machine Learning 2', 0.0003011838672138951),
('Deep Learning 2', 0.00029899313444392865),
('Econophysics', 0.00029810944592071552),
('Big Data Fake 2', 0.00029248837867043803),
('Wall Street', 0.00029248837867043803),
('Deep Learning 3', 0.00029248837867043803)]

You can see those pornographic webpages that pretend to be big data webpages do not have rank as high as those authentic ones. PageRank fights against spam and irrelevant webpages. Google later further improved the algorithm to combat more advanced tricks of spam pages.

You can refer other details in various sources and textbooks. [Rajaraman and Ullman 2011, Wu et. al. 2008]

Use in Social Media and Forums

Mathematically, the PageRank algorithm deals with a directional graph. As one can imagine, any systems that can be modeled as directional graph allow rooms for applying the PageRank algorithm. One extension of PageRank is ExpertiseRank.

Jun Zhang, Mark Ackerman and Lada Adamic published a conference paper in the International World Wide Web (WWW7) in May 2007. [Zhang, Ackerman & Adamic 2007] They investigated into a Java forum, by connecting users to posts and anyone replying to it as a directional graph. With an algorithm closely resembled PageRank, they found the experts and influential people in the forum.

expertiserank
Graphs in ExpertiseRank (take from [Zhang, Ackerman & Adamic 2007])

There are other algorithms like HITS (Hypertext induced topic selection) that does similar things. And social media such as Quora (and its Chinese counterpart, Zhihu) applied a link analysis algorithm (probabilistic topic network, see this.) to perform topic network building. Similar ideas are also applied to identify high-quality content in Yahoo! Answers. [Agichtein, Castillo, Donato, Gionis & Mishne 2008]

Use in Finance and Econophysics

PageRank algorithm is also applied outside information technology fields. Financial engineers and econophysicists applied an algorithm, called DebtRank, which is very similar to PageRank, to determine the systemically important financial institutions in a financial network. This work is published in Nature Scientific Reports. [Battiston, Puliga, Kaushik, Tasca & Caldarelli 2012] In their study, each node represents a financial institution, and a directional edge means the estimated potential impact of an institution to another one. Using DebtRank, we are able to identify the centrally important institutions that potentially impacted other institutions in the network once a financial crisis occurs.

debtrank
D
ebtRank network, taken from [Battiston, Puliga, Kaushik, Tasca & Caldarelli 2012])

Continue reading “Ranking Everything: an Overview of Link Analysis Using PageRank Algorithm”

Blog at WordPress.com.

Up ↑