In the previous post, the convolution of the graph Laplacian is defined in its graph Fourier space as outlined in the paper of Bruna et. al. (arXiv:1312.6203) However, the eigenmodes of the graph Laplacian are not ideal because it makes the bases to be graph-dependent. A lot of works were done in order to solve this problem, with the help of various special functions to express the filter functions. Examples include Chebyshev polynomials and Cayley transform.

## ChebNet

Kipf and Welling proposed the ChebNet (arXiv:1609.02907) to approximate the filter using Chebyshev polynomial, using the result proved by Hammond et. al. (arXiv:0912.3848) With a convolutional layer $g_{\theta} \circ x$, instead of computing it directly by $\Phi^T g_{\theta} \Phi x$, it approximates the filter in terms of Chebyshev polynomial:

$g_{\theta} (\Lambda) \approx \sum_{k=0}^K \theta_k T_k(\Lambda)$,

where $T_k(x)$ are the Chebyshev polynomials. It has been proved by Hammond et. al. that by well truncating this representation, Chebyshev polynomials are very good to approximate the function. Then the convolution is:

$g_{\theta} \circ x \approx \sum_{k=0}^K \theta_k T_k (\Lambda) x$.

Here, $\theta$’s are the parameters to be trained. This fixed the bases of the representation, and it speeds up computation. The disadvantage is that eigenvalues are clusters in a few values with large gaps.

## CayleyNet

The problem of ChebNet led to the work of Levie et. al., (arXiv:1705.07664) who proposed another approximation is used using Cayley transform. They made use of the Cayley function:

$C(\lambda) = \frac{\lambda - i}{\lambda + i}$,

which is a bijective function from $\mathbb{R}^+$ to complex unit half-circle. Instead of Chebyshev polynomials, it approximates the filter as:

$g(\lambda) = c_0 + \sum_{j=1}^r \left[ c_j C^j (h \lambda) + c_j^{\ast} C^{j \ast} (h \lambda) \right]$,

where $c_0$ is real and other $c_j$’s are generally complex, and $h$ is a zoom parameter, and $\lambda$’s are the eigenvalues of the graph Laplacian. Tuning $h$ makes one find the best zoom that spread the top eigenvalues. $c$‘s are computed by training. This solves the problem of unfavorable clusters in ChebNet.

## MotifNet

All previous works are undirected graph. How do we deal with directed graph with an asymmetric graph Laplacian? Benson, Gleich, Leskovec published an important work on Science in 2016 (arXiv:1612.08447) to address this problem. Their approach is reducing a directed graph to a higher order structure called network motifs. There are 13 network motifs. For each network motif, one can define an adjacency matrix for that motif by $\mathcal{M}_k$, with elements $\mathcal{M}_{k, ij}$ being the number of motifs in the graph the the edge $(i, j)$ that it belongs to.

Then one computes 13 graph Laplacians from these 13 adjacency matrices. These graph Laplacians are symmetric, like those of undirected graphs. Then any filters can be approximated by the following multivariate matrix polynomial, as suggested by Monti, Otness, and Bronstein in their MotifNet paper (arXiv:1802.01572):

$f_{\Theta} (\Delta_1, \dots, \Delta_k) = \sum_{j=0}^P \sum_{k_1, \dots, k_j \in {1, \ldots, K}} \theta_{\theta_1, \ldots, \theta_k} \Delta_{k_1}, \ldots, \Delta_{k_j}$.

Applications of image processing, citation networks etc.

Suggested by some friends, I have been reading graph convolutional neural network. This is not a recent interest, as I have been interested in some graph-related problems (as I developed the graphflow package) and topological data analysis (TDA, as I developed the mogutda package). Graph theory is not a new field as well, but it is interesting to see how it is being applied in deep learning, as a lot of real-world relational data can be expressed in terms of graphs. (Examples: word relations, organizational structure, genes, internet etc.) With the increasing attention to neo4j, we know graph is coming to be hot.

It would be helpful to review some basic graph theory:

• A graph is represented by $\mathcal{G} = (V, E)$, where $V$ and $E$ are the nodes (vertices) and edges respectively.
• A graph can be directed or undirected. In graph convolutional neural network, they are undirected usually.
• The adjacency matrix $A$ describes how nodes are connected: $A_{ij} = 1$ if there is an edge connecting from node $i$ to node $j$, and $0$ otherwise. $A$ is a symmetric matrix for an undirected graph.
• The incidence matrix $B$ is another way to describe how nodes are connected: $B_{ij} = 1$ if a node $i$ is connected with edge $j$. This is useful for undirected graph.
• The degree matrix $D$ is a diagonal matrix, with elements $D_{ii}$ denotes the number of neighbors for node $i$ in undirected matrix.
• The function acting on the nodes is called the filter.
• The graph Laplacian, or Kirchhoff matrix, is defined by $L = D - A$, and the normalized graph Laplacian is $\tilde{L} = I - D^{-\frac{1}{2}} A D^{-\frac{1}{2}}$.

The graph Laplacian is the most important matrix in graph convolutional neural network. It is analogous to the Laplacian operator in Euclidean space, $\nabla^2$. The reader can easily verify this by constructing a graph of 2D lattice and compute the graph Laplacian matrix, and find that it is the same as the discretized Laplacian operator.

We can also get some insights from the Euclidean analogue. In physics, the solution to the Laplacian equation is harmonic: the basis of the solution can be described in the spectral/Fourier space, as:

$\nabla^2 \psi = - \omega^2 \psi$,

And $\psi \propto e^{\pm i \omega x}$. In graph convolutional neural network, as Bruna et. al. suggested in 2013, the graph is calculated in the graph Fourier space, instead of directly dealing with the Laplacian matrix in all layers of network.

On the other hand, we know that for the convolution

$(f \circ g) (x) = \int dy f(x-y) g(y)$,

its Fourier transform is given by

$\tilde{( f \circ g)} (\omega) = \tilde{f} (\omega) \tilde{g} (\omega)$.

In Fourier space, the convolution of two functions are just their products. Similarly, in graph convolutional neural network, convolutions can be computed in the Fourier space as the mere product of two filters in the Fourier space. More specifically, for finding the convolution of the filters $f$ and $g$, with $\Phi$ being the unitary eigenmatrix,

$f \circ g = \Phi ((\Phi^T g) \circ (\Phi^T f)) = \Phi \textbf{diag}(g_1, g_2, \ldots, g_n) f$.

However, such general description is basis-dependent, and it is still computationally expensive. More work has been proposed to smooth the representation, which will be covered in the upcoming blogs.

A side note, the readers can verify themselves that

$\sum_{ij} A_{ij} | f_i - g_j |^2 = f \tilde{L} g$

There are many tasks in natural language processing that are challenging. This blog entry is on text summarization, which briefly summarizes the survey article on this topic. (arXiv:1707.02268) The authors of the article defined the task to be

Automatic text summarization is the task of producing a concise and fluent summary while preserving key information content and overall meaning.

There are basically two approaches to this task:

• extractive summarization: identifying important sections of the text, and extracting them; and
• abstractive summarization: producing summary text in a new way.

Most algorithmic methods developed are of the extractive type, while most human writers summarize using abstractive approach. There are many methods in extractive approach, such as identifying given keywords, identifying sentences similar to the title, or wrangling the text at the beginning of the documents.

How do we instruct the machines to perform extractive summarization? The authors mentioned about two representations: topic and indicator. In topic representations, frequencies, tf-idf, latent semantic indexing (LSI), or topic models (such as latent Dirichlet allocation, LDA) are used. However, simply extracting these sentences out with these algorithms may not generate a readable summary. Employment of knowledge bases or considering contexts (from web search, e-mail conversation threads, scientific articles, author styles etc.) are useful.

In indicator representation, the authors mentioned the graph methods, inspired by PageRank. (see this) “Sentences form vertices of the graph and edges between the sentences indicate how similar the two sentences are.” And the key sentences are identified with ranking algorithms. Of course, machine learning methods can be used too.

Evaluation on the performance on text summarization is difficult. Human evaluation is unavoidable, but with manual approaches, some statistics can be calculated, such as ROUGE.

If you have been taking Andrew Ng’s deeplearning.ai course on Coursera, you must have learned in Course 1 about the graph operations, and the method of back propagation using derivatives in terms of graph. In fact, it is the basis of TensorFlow, a Python package commonly used in deep learning. Because it is based on the graph model of computation, we can see it as a “programming language.”

Google published a paper about the big picture of computational model in TensorFlow:

TensorFlow is a powerful, programmable system for machine learning. This paper aims to provide the basics of a conceptual framework for understanding the behavior of TensorFlow models during training and inference: it describes an operational semantics, of the kind common in the literature on programming languages. More broadly, the paper suggests that a programming-language perspective is fruitful in designing and in explaining systems such as TensorFlow.

Beware that this model is not limited to deep learning.

Topology has been around for centuries, but it did not catch the attention of many data analysts until recently. In an article published in Nature Scientific Reports, the authors demonstrated the power of topology in data analysis through examples including gene expression from breast rumors, voting data in the United States, and player performance data from the NBA. [Lum et. al. 2013]

As an introduction, they described topological methods “as a geometric approach to pattern or shape recognition within data.” It is true that in machine learning, we never care enough pattern recognition, but topology adds insights regarding the shapes of data that do not change with continuous deformation. For example, a circle and an ellipse have “the same topology.” The distances between data points are not as important as the shape. Traditional machine learning methods deal with feature vectors, distances, or classifications, but the topology of the data is usually discarded. Gunnar Carlsson demonstrated in a blog that a thin ellipse of data may be misrepresented as two straight parallel lines or one straight lines. [Carlsson 2015] Dimensionality reduction algorithms such as principal component analysis (PCA) often disregard the topology as well. (I heard that Kohenen’s self-organizing maps (SOM) [Kohonen 2000] retain the topology of higher dimensional data during the dimensionality reduction, but I am not confident enough to say that.)

Euler introduced the concept of topology in the 18th century. Topology has been a big subject in physics since 1950s. The string theory, as one of the many efforts in unifying gravity and other three fundamental forces, employs topological dimensions. In condensed matter physics, the fractional quantum Hall effect is a topological quantum effect. There are topological solitons [Rajaraman 1987] such as quantum vortices in superfluids, [Simula, Blakie 2006; Calzetta, Ho, Hu 2010] columns of topological solitons (believed to be Skyrmions) in helical magnets, [Mühlbauer et. al. 2009; Ho et. al. 2010; Ho 2012] hexagonal solitonic objects in smectic liquid crystals [Matsumoto et. al. 2009]… When a field becomes sophisticated, it becomes quantitative; when a quantitative field becomes sophisticated, it requires abstract mathematics such as topology for a general description. I believe analysis on any kinds of data is no exception.

There are some good reviews and readings about topological data analysis (TDA) out there, for example, the ones by Gunnar Carlsson [Carlsson 2009] and Afra Zomorodian [Zomorodian 2011]. While physicists talk about homotopy, data analysts talk about persistent homology as it is easier to compute. Data have to be described in a simplicial complex or a graph/network. Then the homology can be computed and represented in various ways such as barcodes. [Ghrist 2008] Then we extract insights about the data from it.

Topology has a steep learning curve. I am also a starter learning about this. This blog entry will not be the last talking about TDA. Therefore, I opened a new session called TDA for all of my blog entries about it. Let’s start the journey!

There is an R package called “TDA” that facilitates topological data analysis. [Fasy et. al. 2014] A taste of homology of a simplicial complex is also demonstrated in a Wolfram demo.

(Taken from TheGuardian)

This is an age of quantification, meaning that we want to give everything, even qualitative, a number. In schools, teachers measure how good their students master mathematics by grading, or scoring their homework. The funding agencies measure how good a scientist is by counting the number of his publications, the citations, and the impact factors. We measure how successful a person is by his annual income. We can question all these approaches of measurement. Yet however good or bad the measures are, we look for a metric to measure.

Original PageRank Algorithm

We measure webpages too. In the early ages of Internet, people performed searching on sites such as Yahoo or AltaVista. The keywords they entered are the main information for the browser to do the searching. However, a big problem was that a large number of low quality or irrelevant webpages showed up in search results. Some were due to malicious manipulation of keyword tricks. Therefore, it gave rise a need to rank the webpages. Larry Page and Sergey Brin, the founders of Google, tackled this problem as a thesis topic in Stanford University. But this got commercialized, and Brin never received his Ph.D. They published their algorithm, called PageRank, named after Larry Page, at the Seventh International World Wide Web Conference (WWW7) in April 1998. [Brin & Page 1998] This algorithm is regarded as one of the top ten algorithms in data mining by a survey paper published in the IEEE International Conference on Data Mining (ICDM) in December 2006. [Wu et. al. 2008]

Larry Page and Sergey Brin (source)

The idea of the PageRank algorithm is very simple. It regards each webpage as a node, and each link in the webpage as a directional edge from the source to the target webpage. This forms a network, or a directed graph, of webpages connected by their links. A link is seen as a vote to the target homepage, and if the source homepage ranks high, it enhances the target homepage’s ranking as well. Mathematically it involves solving a large matrix using Newton-Raphson’s method. (Technologies involving handling the large matrix led to the MapReduce programming paradigm, another big data trend nowadays.)

Example (made by Python with packages networkx and matplotlib)

Let’s have an intuition through an example. In the network, we can easily see that “Big Data 1” has the highest rank because it has the most edges pointing to it. However, there are pages such as “Big Data Fake 1,” which looks like a big data page, but in fact it points to “Porn 1.” After running the PageRank algorithm, it does not have a high rank. The sample of the output is:

[('Big Data 1', 0.00038399273501500979),
('Artificial Intelligence', 0.00034612564364377323),
('Deep Learning 1', 0.00034221161094691966),
('Machine Learning 1', 0.00034177713235138173),
('Porn 1', 0.00033859136614724074),
('Big Data 2', 0.00033182629176238337),
('Spark', 0.0003305912073357307),
('Dow-Jones 1', 0.00032368956852396916),
('Big Data 3', 0.00030969537721207128),
('Porn 2', 0.00030969537721207128),
('Big Data Fake 1', 0.00030735245262038724),
('Dow-Jones 2', 0.00030461420169420618),
('Machine Learning 2', 0.0003011838672138951),
('Deep Learning 2', 0.00029899313444392865),
('Econophysics', 0.00029810944592071552),
('Big Data Fake 2', 0.00029248837867043803),
('Wall Street', 0.00029248837867043803),
('Deep Learning 3', 0.00029248837867043803)]

You can see those pornographic webpages that pretend to be big data webpages do not have rank as high as those authentic ones. PageRank fights against spam and irrelevant webpages. Google later further improved the algorithm to combat more advanced tricks of spam pages.

You can refer other details in various sources and textbooks. [Rajaraman and Ullman 2011, Wu et. al. 2008]

Use in Social Media and Forums

Mathematically, the PageRank algorithm deals with a directional graph. As one can imagine, any systems that can be modeled as directional graph allow rooms for applying the PageRank algorithm. One extension of PageRank is ExpertiseRank.

Jun Zhang, Mark Ackerman and Lada Adamic published a conference paper in the International World Wide Web (WWW7) in May 2007. [Zhang, Ackerman & Adamic 2007] They investigated into a Java forum, by connecting users to posts and anyone replying to it as a directional graph. With an algorithm closely resembled PageRank, they found the experts and influential people in the forum.

Graphs in ExpertiseRank (take from [Zhang, Ackerman & Adamic 2007])

There are other algorithms like HITS (Hypertext induced topic selection) that does similar things. And social media such as Quora (and its Chinese counterpart, Zhihu) applied a link analysis algorithm (probabilistic topic network, see this.) to perform topic network building. Similar ideas are also applied to identify high-quality content in Yahoo! Answers. [Agichtein, Castillo, Donato, Gionis & Mishne 2008]

Use in Finance and Econophysics

PageRank algorithm is also applied outside information technology fields. Financial engineers and econophysicists applied an algorithm, called DebtRank, which is very similar to PageRank, to determine the systemically important financial institutions in a financial network. This work is published in Nature Scientific Reports. [Battiston, Puliga, Kaushik, Tasca & Caldarelli 2012] In their study, each node represents a financial institution, and a directional edge means the estimated potential impact of an institution to another one. Using DebtRank, we are able to identify the centrally important institutions that potentially impacted other institutions in the network once a financial crisis occurs.

D
ebtRank network, taken from [Battiston, Puliga, Kaushik, Tasca & Caldarelli 2012])