Constructing Connectivities

In my previous blog post, I introduced the newly emerged topological data analysis (TDA). Unlike most of the other data analytic algorithms, TDA, concerning the topology as its name tells, cares for the connectivity of points, instead of the distance (according to a metric, whether it is Euclidean, Manhattan, Minkowski or any other). What is the best tools to describe topology?

Physicists use a lot homotopy. But for the sake of computation, it is better to use a scheme that are suited for discrete computation. It turns out that there are useful tools in algebraic topology: homology. But to understand homology, we need to understand what a simplicial complex is.

Gunnar Carlsson [Carlsson 2009] and Afra Zomorodian [Zomorodian 2011] wrote good reviews about them, although from a different path in introducing the concept. I first followed Zomorodian’s review [Zomorodian 2011], then his book [Zomorodian 2009] that filled in a lot of missing links in his review, to a certain point. I recently started reading Carlsson’s review.

One must first understand what a simplicial complex is. Without giving too much technical details, simplicial complex is basically a shape connecting points together. A line is a 1-simplex, connecting two points. A triangle is a 2-simplex. A tetrahedron is a 3-complex. There are other more complicated and unnamed complexes. Any subsets of a simplicial complex are faces. For example, the sides of the triangle are faces. The faces and the sides are the faces of the tetrahedron. (Refer to Wolfram MathWorld for more details. There are a lot of good tutorials online.)

Implementing Simplicial Complex

We can easily encoded this into a python code. I wrote a class SimplicialComplex in Python to implement this. We first import necessary libraries:

import numpy as np
from itertools import combinations
from scipy.sparse import dok_matrix
from operator import add

The first line imports the numpy library, the second the iteration tools necessary for extracting the faces for simplicial complex, the third the sparse matrix implementation in the scipy library (applied on something that I will not go over in this blog entry), and the fourth for some reduce operation.

We want to describe the simplicial complexes in the order of some labels (which can be anything, such as integers or strings). If it is a point, then it can be represented as tuples, as below:

 (1,) 

Or if it is a line (a 1-simplex), then

 (1, 2) 

Or a 2-simplex as a triangle, then

 (1, 2, 3) 

I think you get the gist. The integers 1, 2, or 3 here are simply labels. We can easily store this in the class:

class SimplicialComplex:
  def __init__(self, simplices=[]):
    self.import_simplices(simplices=simplices)

  def import_simplices(self, simplices=[]):
    self.simplices = map(lambda simplex: tuple(sorted(simplex)), simplices)
    self.face_set = self.faces()

You might observe the last line of the codes above. And it is for calculating all the faces of this complex, and it is implemented in this way:

  def faces(self):
    faceset = set()
    for simplex in self.simplices:
      numnodes = len(simplex)
      for r in range(numnodes, 0, -1):
        for face in combinations(simplex, r):
          faceset.add(face)
    return faceset

The faces are intuitively sides of a 2D shape (2-simplex), or planes of a 3D shape (3-simplex). But the faces of a 3-simplex includes the faces of all its faces. All the faces are saved in a field called faceset. If the user wants to retrieve the faces in a particular dimension, they can call this method:

  def n_faces(self, n):
    return filter(lambda face: len(face)==n+1, self.face_set)

There are other methods that I am not going over in this blog entry. Now let us demonstrate how to use the class by implementing a tetrahedron.

sc = SimplicialComplex([('a', 'b', 'c', 'd')])

If we want to extract the faces, then enter:

sc.faces()

which outputs:

{('a',),
 ('a', 'b'),
 ('a', 'b', 'c'),
 ('a', 'b', 'c', 'd'),
 ('a', 'b', 'd'),
 ('a', 'c'),
 ('a', 'c', 'd'),
 ('a', 'd'),
 ('b',),
 ('b', 'c'),
 ('b', 'c', 'd'),
 ('b', 'd'),
 ('c',),
 ('c', 'd'),
 ('d',)}

We have gone over the basis of simplicial complex, which is the foundation of TDA. We appreciate that the simplicial complex deals only with the connectivity of points instead of the distances between the points. And the homology groups will be calculated based on this. However, how do we obtain the simplicial complex from the discrete data we have? Zomorodian’s review [Zomorodian 2011] gave a number of examples, but I will only go through two of them only. And from this, you can see that to establish the connectivity between points, we still need to apply some sort of distance metrics.

Alpha Complex

An alpha complex is the nerve of the cover of the restricted Voronoi regions. (Refer the details to Zomorodian’s review [Zomorodian 2011], this Wolfram MathWorld entry, or this Wolfram Demonstration.) We can extend the class SimplicialComplex to get a class AlphaComplex:

from scipy.spatial import Delaunay, distance
from operator import or_
from functools import partial

def facesiter(simplex):
  for i in range(len(simplex)):
    yield simplex[:i]+simplex[(i+1):]

def flattening_simplex(simplices):
  for simplex in simplices:
    for point in simplex:
      yield point

def get_allpoints(simplices):
  return set(flattening_simplex(simplices))

def contain_detachededges(simplex, distdict, epsilon):
  if len(simplex)==2:
    return (distdict[simplex[0], simplex[1]] > 2*epsilon)
  else:
    return reduce(or_, map(partial(contain_detachededges, distdict=distdict, epsilon=epsilon), facesiter(simplex)))

class AlphaComplex(SimplicialComplex):
  def __init__(self, points, epsilon, labels=None, distfcn=distance.euclidean):
    self.pts = points
    self.labels = range(len(self.pts)) if labels==None or len(labels)!=len(self.pts) else labels
    self.epsilon = epsilon
    self.distfcn = distfcn
    self.import_simplices(self.construct_simplices(self.pts, self.labels, self.epsilon, self.distfcn))

  def calculate_distmatrix(self, points, labels, distfcn):
    distdict = {}
    for i in range(len(labels)):
      for j in range(len(labels)):
        distdict[(labels[i], labels[j])] = distfcn(points[i], points[j])
    return distdict

  def construct_simplices(self, points, labels, epsilon, distfcn):
    delaunay = Delaunay(points)
    delaunay_simplices = map(tuple, delaunay.simplices)
    distdict = self.calculate_distmatrix(points, labels, distfcn)

    simplices = []
    for simplex in delaunay_simplices:
      faces = list(facesiter(simplex))
      detached = map(partial(contain_detachededges, distdict=distdict, epsilon=epsilon), faces)
      if reduce(or_, detached):
        if len(simplex)>2:
          for face, notkeep in zip(faces, detached):
            if not notkeep:
              simplices.append(face)
      else:
        simplices.append(simplex)
    simplices = map(lambda simplex: tuple(sorted(simplex)), simplices)
    simplices = list(set(simplices))

    allpts = get_allpoints(simplices)
    for point in (set(labels)-allpts):
      simplices += [(point,)]

    return simplices

The scipy package already has a package to calculate Delaunay region. The function contain_detachededges is for constructing the restricted Voronoi region from the calculated Delaunay region.

This class demonstrates how an Alpha Complex is constructed, but this runs slowly once the number of points gets big!

Vietoris-Rips (VR) Complex

Another commonly used complex is called the Vietoris-Rips (VR) Complex, which connects points as the edge of a graph if they are close enough. (Refer to Zomorodian’s review [Zomorodian 2011] or this Wikipedia page for details.) To implement this, import the famous networkx originally designed for network analysis.

import networkx as nx
from scipy.spatial import distance
from itertools import product

class VietorisRipsComplex(SimplicialComplex):
  def __init__(self, points, epsilon, labels=None, distfcn=distance.euclidean):
    self.pts = points
    self.labels = range(len(self.pts)) if labels==None or len(labels)!=len(self.pts) else labels
    self.epsilon = epsilon
    self.distfcn = distfcn
    self.network = self.construct_network(self.pts, self.labels, self.epsilon, self.distfcn)
    self.import_simplices(map(tuple, list(nx.find_cliques(self.network))))

  def construct_network(self, points, labels, epsilon, distfcn):
    g = nx.Graph()
    g.add_nodes_from(labels)
    zips = zip(points, labels)
    for pair in product(zips, zips):
      if pair[0][1]!=pair[1][1]:
        dist = distfcn(pair[0][0], pair[1][0])
        if dist<epsilon:
          g.add_edge(pair[0][1], pair[1][1])
    return g

The intuitiveness and efficiencies are the reasons that VR complexes are widely used.

For more details about the Alpha Complexes, VR Complexes and the related Čech Complexes, refer to this page.

More…

There are other commonly used complexes used, including Witness Complex, Cubical Complex etc., which I leave no introductions. Upon building the complexes, we can analyze the topology by calculating their homology groups, Betti numbers, the persistent homology etc. I wish to write more about it soon.

Taken from Wolfram Mathworld
Taken from Wolfram Mathworld

Continue reading “Constructing Connectivities”

Stochastics and Sentiment Analysis in Wall Street

Wall Street is not only a place of facilitating the money flow, but also a playground for scientists.

When I was young, I saw one of my uncles plotting prices for stocks to perform technical analysis. When I was in college, my friends often talked about investing in a few financial futures and options. When I was doing my graduate degree in physics, we studied John Hull’s famous textbook [Hull 2011] on quantitative finance to learn about financial modeling. A few of my classmates went to Wall Street to become quantitative analysts or financial software developers. There are ups and downs in the financial markets. But as long as we are in a capitalist society, finance is a subject we never ignore. However, scientists have not come up with a consensus about the nature of a financial market.

Agent-Based Models

Economists believe that individuals in a market are rational being who always aim at maximizing their profits. They often apply agent-based models, which employs complex system theories or game theory.

Random Processes and Statistical Physics

However, a lot of mathematicians in Wall Street (including quantitative analysts and econophysicists) see the stock prices as undergoing Brownian motion. [Hull 2011, Baaquie 2007] They employ tools in statistical physics and stochastic processes to study the pricings of various financial derivatives. Therefore, the random-process and econophysical approaches have nothing much about stock price prediction (despite the fact that they do need a “return rate” in their model.) Random processes are unpredictable.

However, some sort of predictions carry great values. For example, when there is overhypes or bubbles in the market, we want to know when it will burst. There are models that predict defaults and bubble burst in a market using the log-periodic power law (LPPL). [Wosnitza, Denz 2013] In addition, there has been research showing the leverage effect in stock markets in developed countries such as Germany (c.f. fluctuation-dissipation theorem in statistical physics), and anti-leverage effect in China (Shanghai and Shenzhen). [Qiu, Zhen, Ren, Trimper 2006]

Reconciling Intelligence and Randomness

There are some values to both views. It is hard to believe that stock prices are completely random, as the economic environment and the public opinions must affect the stock prices. People can neither be completely rational nor completely random.

There has been some study in reconciling game theory and random processes, in an attempt to bring economists and mathematicians together. In this theoretical framework, financial systems still sought to attain the maximum entropy (randomness), but the “particles” in the system behaves intelligently. [Venkatasubramanian, Luo, Sethuraman 2015] (See my another blog entry: MathAnalytics (1) – Beautiful Mind, Physical Nature and Economic Inequality) We are not sure how successful this attempt will be at this point.

Sentiment Analysis

As people are talking about big data in recent years, there have been attempts to apply machine learning algorithms in finance. However, scientists tend not to price using machine learning algorithms because these algorithms mostly perform classification. However, there are attempts, with natural language processing (NLP) techniques, to predict the stock prices by detecting the public emotions (or sentiments) in social media such as Twitter. [Bollen, Mao, Zeng 2010] It has been found that measuring the public mood in a few dimensions (including Calm, Alert, Sure, Vital, Kind, and Happy) allows scientists to accurately predict the trend of Dow Jones Industrial Average (DJIA). However, some hackers take advantage on the sentiment analysis on Twitter. In 2013, there was a rumor on Twitter saying the White House being bombed, The computers responded instantly and automatically by performing trading, causing the stock market to fall immediately. But the market restored quickly after it was discovered that the news was fake. (Fig. 1)

Fig. 1: DJIA fell because of a rumor of the White House being bombed, but recovered when discovered the news was fake (taken from http://www.rt.com/news/syrian-electronic-army-ap-twitter-349/)

P.S.: While I was writing this, I saw an interesting statement in the paper about leverage effect. [Qiu, Zhen, Ren, Trimper 2006] The authors said that:

Why do the German and Chinese markets exhibit different return-volatility correlations? Germany is a developed country. To some extent, people show risk aversion, and therefore, may be nervous in trading as the stock price is falling. This induces a higher volatility. When the price is rising, people feel safe and are inactive in trading. Thus, the stock price tends to be stable. This should be the social origin of the leverage effect. However, China just experiences the first stage of capitalism, and people are somewhat excessive speculative in the financial markets. Therefore, people rush for trading as the stock price increases. When the price drops, people stay inactive in trading and wait for rising up of the stock price. That explains the antileverage effect.

Does this paragraph written in 2006 give a hint of what happened in China in 2015 now? (Fig. 2)

Fig. 2: The fall of Chinese stock market in 2015 (taken from http://www.economicpolicyjournal.com/2015/06/breaking-biggest-chinese-stock-market.html)

Continue reading “Stochastics and Sentiment Analysis in Wall Street”

Lyrics Generation

Eminem (taken from web)

When I saw the “standardized” style of writing written in the classic book “Elements of Style” written by William Strunk, I have been wondering if the style of writing can be programmable. And now, with artificial intelligence, people can write automated codes that generate lyrics. In a paper written by Eric Malmi and his collaborators [Malmi, Takala, Toivonen, Raiko, Gionis 2015], the system DopeLearning, which generates rap lyrics to certain complexities, was introduced. It applies two machine learning techniques, namely the RankSVM and the deep neural network. It is fascinating that automated codes can be creative to produce complex artistic things, as the abstract says:

Writing rap lyrics requires both creativity, to construct a meaningful and an interesting story, and lyrical skills, to produce complex rhyme patterns, which are the cornerstone of a good flow.

What does DopeLearning produce? See the example the paper gives:

For a chance at romance I would love to enhance (Big Daddy Kane – The Day You’re Mine)
But everything I love has turned to a tedious task (Jedi Mind Tricks – Black Winter Day)
One day we gonna have to leave our love in the past (Lil Wayne – Marvin’s Room)
I love my fans but no one ever puts a grasp (Eminem – Say Goodbye Hollywood)
I love you momma I love my momma – I love you momma (Snoop Dogg – I Love My Momma)
And I would love to have a thing like you on my team you take care (Ghostface Killah – Paragraphs Of Love)
I love it when it’s sunny Sonny girl you could be my Cher (Common – Make My Day)
I’m in a love affair I can’t share it ain’t fair (Snoop Dogg – Show Me Love)
Haha I’m just playin’ ladies you know I love you. (Eminem – Kill You)
I know my love is true and I know you love me too (Everlast – On The Edge)
Girl I’m down for whatever cause my love is true (Lil Wayne – Sean Kingston I’m At War)
This one goes to my man old dirty one love we be swigging brew (Big Daddy Kane – Entaprizin)
My brother I love you Be encouraged man And just know (Tech N9ne – Need More Angels)
When you done let me know cause my love make you be like WHOA (Missy Elliot – Dog In Heat)
If I can’t do it for the love then do it I won’t (KRS One – Take It To God)
All I know is I love you too much to walk away though (Eminem – Love The Way You Lie)

There are similar work for Chinese Mandopop, using RNN. Chinese readers can refer to this blog post: http://phunters.lofter.com/post/86d56_732209b.

Continue reading “Lyrics Generation”

Partitioning Antibodies: HTJoinSolver

HTJoinSolver-YIcon of HTJoinSolver

Big data is also impacting biomedical research and clinical processes. In 1987,  Susumu Tonegawa was awarded the Nobel Prize in Physiology or Medicine “for his discovery of the genetic principle for generation of antibody diversity.” The mechanism that forms the diverse antibodies in our bodies is called V(D)J recombination, or less commonly known as somatic recombination. [Tonegawa 1983] Since then, there are a lot of bioinformatician work on this.

Antibody_IgG2
Antibody (taken from Wikipedia)

Antibodies are often referred to as immunoglobulin (Ig). It is a combination of three gene segments: the Variety (V), the Diversity (D), and the Joining (J) gene segments. IMGT standardized the types of different segments of Ig.

There are a number of computational tools that partition the different segments of Ig. One of the most promising tools is the JoinSolver, developed by the collaboration between Center for Information Technology (CIT) and National Institute of Allergic and Infectious Disease (NIAID) of National Institutes of Health (NIH). [Souto-Carneiro, Longo, Russ, Sun & Lipsky 2004] However, as JoinSolver was not designed to handle insertions and deletions on gene sequences, a further improvement was needed. And the vast volume of sequence data urged a tool that can handle them with a more efficient algorithm, and run on multi-processing systems. It is how the idea of HTJoinSolver formed, where HT stands for “high-throughput.” Russ, Ho and Longo published their work in BMC Bioinformatics [Russ, Ho & Longo 2015], and the tool is available on their website.

HTJoinSolver is a collaboration of Division of Computational Biosciences (DCB) of CIT, and NIAID in NIH. HTJoinSolver partitions an Ig using an efficient dynamic programming (DP) algorithm that employs prior biological information of Ig. Usual DP algorithms, such as Smith-Waterman algorithm, compare two sequences with a full matrix of size mxn, where m and n are the lengths of the two sequences. (Refer to [Durbin, Eddy, Krogh & Mitchison 1998] for more details.) However, with a known motif in the V segment of Ig, TATTAGTGT, HTJoinSolver speeds up the comparison by filling the diagonals only, unless there are some variations that require full computation in some small regions of the matrix, as shown below:

s12859-015-0589-x-2-lApproximate DP algorithm in HTJoinSolver, taken from [Russ, Ho & Longo 2015]

After partitioning, the tool further analyzes the sequences, such as CDR3, excision, and mutation rate. This tool identifies various segments of Ig with extremely high accuracies even if the mutation probabilities of the Ig’s are as high as 30%. It speeds up the research and clinical process of immunologists and clinicians.

This is a very good example of big data in biomedical applications.

Continue reading “Partitioning Antibodies: HTJoinSolver”

Ranking Everything: an Overview of Link Analysis Using PageRank Algorithm

This is an age of quantification, meaning that we want to give everything, even qualitative, a number. In schools, teachers measure how good their students master mathematics by grading, or scoring their homework. The funding agencies measure how good a scientist is by counting the number of his publications, the citations, and the impact factors. We measure how successful a person is by his annual income. We can question all these approaches of measurement. Yet however good or bad the measures are, we look for a metric to measure.

Original PageRank Algorithm

We measure webpages too. In the early ages of Internet, people performed searching on sites such as Yahoo or AltaVista. The keywords they entered are the main information for the browser to do the searching. However, a big problem was that a large number of low quality or irrelevant webpages showed up in search results. Some were due to malicious manipulation of keyword tricks. Therefore, it gave rise a need to rank the webpages. Larry Page and Sergey Brin, the founders of Google, tackled this problem as a thesis topic in Stanford University. But this got commercialized, and Brin never received his Ph.D. They published their algorithm, called PageRank, named after Larry Page, at the Seventh International World Wide Web Conference (WWW7) in April 1998. [Brin & Page 1998] This algorithm is regarded as one of the top ten algorithms in data mining by a survey paper published in the IEEE International Conference on Data Mining (ICDM) in December 2006. [Wu et. al. 2008]

Google-s-Larry-Page-and-Sergey-Brin-Are-3-2-19-Billion-Richer-in-One-Day-392729-2
Larry Page and Sergey Brin (source)

The idea of the PageRank algorithm is very simple. It regards each webpage as a node, and each link in the webpage as a directional edge from the source to the target webpage. This forms a network, or a directed graph, of webpages connected by their links. A link is seen as a vote to the target homepage, and if the source homepage ranks high, it enhances the target homepage’s ranking as well. Mathematically it involves solving a large matrix using Newton-Raphson’s method. (Technologies involving handling the large matrix led to the MapReduce programming paradigm, another big data trend nowadays.)

figure_1_webnet
Example (made by Python with packages networkx and matplotlib)

Let’s have an intuition through an example. In the network, we can easily see that “Big Data 1” has the highest rank because it has the most edges pointing to it. However, there are pages such as “Big Data Fake 1,” which looks like a big data page, but in fact it points to “Porn 1.” After running the PageRank algorithm, it does not have a high rank. The sample of the output is:

[('Big Data 1', 0.00038399273501500979),
('Artificial Intelligence', 0.00034612564364377323),
('Deep Learning 1', 0.00034221161094691966),
('Machine Learning 1', 0.00034177713235138173),
('Porn 1', 0.00033859136614724074),
('Big Data 2', 0.00033182629176238337),
('Spark', 0.0003305912073357307),
('Hadoop', 0.00032928389859040422),
('Dow-Jones 1', 0.00032368956852396916),
('Big Data 3', 0.00030969537721207128),
('Porn 2', 0.00030969537721207128),
('Big Data Fake 1', 0.00030735245262038724),
('Dow-Jones 2', 0.00030461420169420618),
('Machine Learning 2', 0.0003011838672138951),
('Deep Learning 2', 0.00029899313444392865),
('Econophysics', 0.00029810944592071552),
('Big Data Fake 2', 0.00029248837867043803),
('Wall Street', 0.00029248837867043803),
('Deep Learning 3', 0.00029248837867043803)]

You can see those pornographic webpages that pretend to be big data webpages do not have rank as high as those authentic ones. PageRank fights against spam and irrelevant webpages. Google later further improved the algorithm to combat more advanced tricks of spam pages.

You can refer other details in various sources and textbooks. [Rajaraman and Ullman 2011, Wu et. al. 2008]

Use in Social Media and Forums

Mathematically, the PageRank algorithm deals with a directional graph. As one can imagine, any systems that can be modeled as directional graph allow rooms for applying the PageRank algorithm. One extension of PageRank is ExpertiseRank.

Jun Zhang, Mark Ackerman and Lada Adamic published a conference paper in the International World Wide Web (WWW7) in May 2007. [Zhang, Ackerman & Adamic 2007] They investigated into a Java forum, by connecting users to posts and anyone replying to it as a directional graph. With an algorithm closely resembled PageRank, they found the experts and influential people in the forum.

expertiserank
Graphs in ExpertiseRank (take from [Zhang, Ackerman & Adamic 2007])

There are other algorithms like HITS (Hypertext induced topic selection) that does similar things. And social media such as Quora (and its Chinese counterpart, Zhihu) applied a link analysis algorithm (probabilistic topic network, see this.) to perform topic network building. Similar ideas are also applied to identify high-quality content in Yahoo! Answers. [Agichtein, Castillo, Donato, Gionis & Mishne 2008]

Use in Finance and Econophysics

PageRank algorithm is also applied outside information technology fields. Financial engineers and econophysicists applied an algorithm, called DebtRank, which is very similar to PageRank, to determine the systemically important financial institutions in a financial network. This work is published in Nature Scientific Reports. [Battiston, Puliga, Kaushik, Tasca & Caldarelli 2012] In their study, each node represents a financial institution, and a directional edge means the estimated potential impact of an institution to another one. Using DebtRank, we are able to identify the centrally important institutions that potentially impacted other institutions in the network once a financial crisis occurs.

debtrank
D
ebtRank network, taken from [Battiston, Puliga, Kaushik, Tasca & Caldarelli 2012])

Continue reading “Ranking Everything: an Overview of Link Analysis Using PageRank Algorithm”

Blog at WordPress.com.

Up ↑