Constructing Connectivities

In my previous blog post, I introduced the newly emerged topological data analysis (TDA). Unlike most of the other data analytic algorithms, TDA, concerning the topology as its name tells, cares for the connectivity of points, instead of the distance (according to a metric, whether it is Euclidean, Manhattan, Minkowski or any other). What is the best tools to describe topology?

Physicists use a lot homotopy. But for the sake of computation, it is better to use a scheme that are suited for discrete computation. It turns out that there are useful tools in algebraic topology: homology. But to understand homology, we need to understand what a simplicial complex is.

Gunnar Carlsson [Carlsson 2009] and Afra Zomorodian [Zomorodian 2011] wrote good reviews about them, although from a different path in introducing the concept. I first followed Zomorodian’s review [Zomorodian 2011], then his book [Zomorodian 2009] that filled in a lot of missing links in his review, to a certain point. I recently started reading Carlsson’s review.

One must first understand what a simplicial complex is. Without giving too much technical details, simplicial complex is basically a shape connecting points together. A line is a 1-simplex, connecting two points. A triangle is a 2-simplex. A tetrahedron is a 3-complex. There are other more complicated and unnamed complexes. Any subsets of a simplicial complex are faces. For example, the sides of the triangle are faces. The faces and the sides are the faces of the tetrahedron. (Refer to Wolfram MathWorld for more details. There are a lot of good tutorials online.)

Implementing Simplicial Complex

We can easily encoded this into a python code. I wrote a class SimplicialComplex in Python to implement this. We first import necessary libraries:

import numpy as np
from itertools import combinations
from scipy.sparse import dok_matrix
from operator import add

The first line imports the numpy library, the second the iteration tools necessary for extracting the faces for simplicial complex, the third the sparse matrix implementation in the scipy library (applied on something that I will not go over in this blog entry), and the fourth for some reduce operation.

We want to describe the simplicial complexes in the order of some labels (which can be anything, such as integers or strings). If it is a point, then it can be represented as tuples, as below:

 (1,)

Or if it is a line (a 1-simplex), then

 (1, 2)

Or a 2-simplex as a triangle, then

 (1, 2, 3)

I think you get the gist. The integers 1, 2, or 3 here are simply labels. We can easily store this in the class:

class SimplicialComplex:
  def __init__(self, simplices=[]):
    self.import_simplices(simplices=simplices)

  def import_simplices(self, simplices=[]):
    self.simplices = map(lambda simplex: tuple(sorted(simplex)), simplices)
    self.face_set = self.faces()

You might observe the last line of the codes above. And it is for calculating all the faces of this complex, and it is implemented in this way:

  def faces(self):
    faceset = set()
    for simplex in self.simplices:
      numnodes = len(simplex)
      for r in range(numnodes, 0, -1):
        for face in combinations(simplex, r):
          faceset.add(face)
    return faceset

The faces are intuitively sides of a 2D shape (2-simplex), or planes of a 3D shape (3-simplex). But the faces of a 3-simplex includes the faces of all its faces. All the faces are saved in a field called faceset. If the user wants to retrieve the faces in a particular dimension, they can call this method:

  def n_faces(self, n):
    return filter(lambda face: len(face)==n+1, self.face_set)

There are other methods that I am not going over in this blog entry. Now let us demonstrate how to use the class by implementing a tetrahedron.

sc = SimplicialComplex([('a', 'b', 'c', 'd')])

If we want to extract the faces, then enter:

sc.faces()

which outputs:

{('a',),
 ('a', 'b'),
 ('a', 'b', 'c'),
 ('a', 'b', 'c', 'd'),
 ('a', 'b', 'd'),
 ('a', 'c'),
 ('a', 'c', 'd'),
 ('a', 'd'),
 ('b',),
 ('b', 'c'),
 ('b', 'c', 'd'),
 ('b', 'd'),
 ('c',),
 ('c', 'd'),
 ('d',)}

We have gone over the basis of simplicial complex, which is the foundation of TDA. We appreciate that the simplicial complex deals only with the connectivity of points instead of the distances between the points. And the homology groups will be calculated based on this. However, how do we obtain the simplicial complex from the discrete data we have? Zomorodian’s review [Zomorodian 2011] gave a number of examples, but I will only go through two of them only. And from this, you can see that to establish the connectivity between points, we still need to apply some sort of distance metrics.

Alpha Complex

An alpha complex is the nerve of the cover of the restricted Voronoi regions. (Refer the details to Zomorodian’s review [Zomorodian 2011], this Wolfram MathWorld entry, or this Wolfram Demonstration.) We can extend the class SimplicialComplex to get a class AlphaComplex:

from scipy.spatial import Delaunay, distance
from operator import or_
from functools import partial

def facesiter(simplex):
  for i in range(len(simplex)):
    yield simplex[:i]+simplex[(i+1):]

def flattening_simplex(simplices):
  for simplex in simplices:
    for point in simplex:
      yield point

def get_allpoints(simplices):
  return set(flattening_simplex(simplices))

def contain_detachededges(simplex, distdict, epsilon):
  if len(simplex)==2:
    return (distdict[simplex[0], simplex[1]] &gt; 2*epsilon)
  else:
    return reduce(or_, map(partial(contain_detachededges, distdict=distdict, epsilon=epsilon), facesiter(simplex)))

class AlphaComplex(SimplicialComplex):
  def __init__(self, points, epsilon, labels=None, distfcn=distance.euclidean):
    self.pts = points
    self.labels = range(len(self.pts)) if labels==None or len(labels)!=len(self.pts) else labels
    self.epsilon = epsilon
    self.distfcn = distfcn
    self.import_simplices(self.construct_simplices(self.pts, self.labels, self.epsilon, self.distfcn))

  def calculate_distmatrix(self, points, labels, distfcn):
    distdict = {}
    for i in range(len(labels)):
      for j in range(len(labels)):
        distdict[(labels[i], labels[j])] = distfcn(points[i], points[j])
    return distdict

  def construct_simplices(self, points, labels, epsilon, distfcn):
    delaunay = Delaunay(points)
    delaunay_simplices = map(tuple, delaunay.simplices)
    distdict = self.calculate_distmatrix(points, labels, distfcn)

    simplices = []
    for simplex in delaunay_simplices:
      faces = list(facesiter(simplex))
      detached = map(partial(contain_detachededges, distdict=distdict, epsilon=epsilon), faces)
      if reduce(or_, detached):
        if len(simplex)&gt;2:
          for face, notkeep in zip(faces, detached):
            if not notkeep:
              simplices.append(face)
      else:
        simplices.append(simplex)
    simplices = map(lambda simplex: tuple(sorted(simplex)), simplices)
    simplices = list(set(simplices))

    allpts = get_allpoints(simplices)
    for point in (set(labels)-allpts):
      simplices += [(point,)]

    return simplices

The scipy package already has a package to calculate Delaunay region. The function contain_detachededges is for constructing the restricted Voronoi region from the calculated Delaunay region.

This class demonstrates how an Alpha Complex is constructed, but this runs slowly once the number of points gets big!

Vietoris-Rips (VR) Complex

Another commonly used complex is called the Vietoris-Rips (VR) Complex, which connects points as the edge of a graph if they are close enough. (Refer to Zomorodian’s review [Zomorodian 2011] or this Wikipedia page for details.) To implement this, import the famous networkx originally designed for network analysis.

import networkx as nx
from scipy.spatial import distance
from itertools import product

class VietorisRipsComplex(SimplicialComplex):
  def __init__(self, points, epsilon, labels=None, distfcn=distance.euclidean):
    self.pts = points
    self.labels = range(len(self.pts)) if labels==None or len(labels)!=len(self.pts) else labels
    self.epsilon = epsilon
    self.distfcn = distfcn
    self.network = self.construct_network(self.pts, self.labels, self.epsilon, self.distfcn)
    self.import_simplices(map(tuple, list(nx.find_cliques(self.network))))

  def construct_network(self, points, labels, epsilon, distfcn):
    g = nx.Graph()
    g.add_nodes_from(labels)
    zips = zip(points, labels)
    for pair in product(zips, zips):
      if pair[0][1]!=pair[1][1]:
        dist = distfcn(pair[0][0], pair[1][0])
        if dist&lt;epsilon:
          g.add_edge(pair[0][1], pair[1][1])
    return g

The intuitiveness and efficiencies are the reasons that VR complexes are widely used.

For more details about the Alpha Complexes, VR Complexes and the related Čech Complexes, refer to this page.

More…

There are other commonly used complexes used, including Witness Complex, Cubical Complex etc., which I leave no introductions. Upon building the complexes, we can analyze the topology by calculating their homology groups, Betti numbers, the persistent homology etc. I wish to write more about it soon.

Continue reading “Constructing Connectivities” →

September 14, 2015 5

Stochastics and Sentiment Analysis in Wall Street

Wall Street is not only a place of facilitating the money flow, but also a playground for scientists.

When I was young, I saw one of my uncles plotting prices for stocks to perform technical analysis. When I was in college, my friends often talked about investing in a few financial futures and options. When I was doing my graduate degree in physics, we studied John Hull’s famous textbook [Hull 2011] on quantitative finance to learn about financial modeling. A few of my classmates went to Wall Street to become quantitative analysts or financial software developers. There are ups and downs in the financial markets. But as long as we are in a capitalist society, finance is a subject we never ignore. However, scientists have not come up with a consensus about the nature of a financial market.

Agent-Based Models

Economists believe that individuals in a market are rational being who always aim at maximizing their profits. They often apply agent-based models, which employs complex system theories or game theory.

Random Processes and Statistical Physics

However, a lot of mathematicians in Wall Street (including quantitative analysts and econophysicists) see the stock prices as undergoing Brownian motion. [Hull 2011, Baaquie 2007] They employ tools in statistical physics and stochastic processes to study the pricings of various financial derivatives. Therefore, the random-process and econophysical approaches have nothing much about stock price prediction (despite the fact that they do need a “return rate” in their model.) Random processes are unpredictable.

However, some sort of predictions carry great values. For example, when there is overhypes or bubbles in the market, we want to know when it will burst. There are models that predict defaults and bubble burst in a market using the log-periodic power law (LPPL). [Wosnitza, Denz 2013] In addition, there has been research showing the leverage effect in stock markets in developed countries such as Germany (c.f. fluctuation-dissipation theorem in statistical physics), and anti-leverage effect in China (Shanghai and Shenzhen). [Qiu, Zhen, Ren, Trimper 2006]

Reconciling Intelligence and Randomness

There are some values to both views. It is hard to believe that stock prices are completely random, as the economic environment and the public opinions must affect the stock prices. People can neither be completely rational nor completely random.

There has been some study in reconciling game theory and random processes, in an attempt to bring economists and mathematicians together. In this theoretical framework, financial systems still sought to attain the maximum entropy (randomness), but the “particles” in the system behaves intelligently. [Venkatasubramanian, Luo, Sethuraman 2015] (See my another blog entry: MathAnalytics (1) – Beautiful Mind, Physical Nature and Economic Inequality) We are not sure how successful this attempt will be at this point.

Sentiment Analysis

As people are talking about big data in recent years, there have been attempts to apply machine learning algorithms in finance. However, scientists tend not to price using machine learning algorithms because these algorithms mostly perform classification. However, there are attempts, with natural language processing (NLP) techniques, to predict the stock prices by detecting the public emotions (or sentiments) in social media such as Twitter. [Bollen, Mao, Zeng 2010] It has been found that measuring the public mood in a few dimensions (including Calm, Alert, Sure, Vital, Kind, and Happy) allows scientists to accurately predict the trend of Dow Jones Industrial Average (DJIA). However, some hackers take advantage on the sentiment analysis on Twitter. In 2013, there was a rumor on Twitter saying the White House being bombed, The computers responded instantly and automatically by performing trading, causing the stock market to fall immediately. But the market restored quickly after it was discovered that the news was fake. (Fig. 1)

Fig. 1: DJIA fell because of a rumor of the White House being bombed, but recovered when discovered the news was fake (taken from http://www.rt.com/news/syrian-electronic-army-ap-twitter-349/)

P.S.: While I was writing this, I saw an interesting statement in the paper about leverage effect. [Qiu, Zhen, Ren, Trimper 2006] The authors said that:

Why do the German and Chinese markets exhibit different return-volatility correlations? Germany is a developed country. To some extent, people show risk aversion, and therefore, may be nervous in trading as the stock price is falling. This induces a higher volatility. When the price is rising, people feel safe and are inactive in trading. Thus, the stock price tends to be stable. This should be the social origin of the leverage effect. However, China just experiences the first stage of capitalism, and people are somewhat excessive speculative in the financial markets. Therefore, people rush for trading as the stock price increases. When the price drops, people stay inactive in trading and wait for rising up of the stock price. That explains the antileverage effect.

Does this paragraph written in 2006 give a hint of what happened in China in 2015 now? (Fig. 2)

Fig. 2: The fall of Chinese stock market in 2015 (taken from http://www.economicpolicyjournal.com/2015/06/breaking-biggest-chinese-stock-market.html)

Continue reading “Stochastics and Sentiment Analysis in Wall Street” →

July 17, 2015 0

Learning by Zooming Out

Deep learning, a collection of related neural network algorithms, has been proved successful in certain types of machine learning tasks in computer vision, speech recognition, data cleaning, and natural language processing (NLP). [Mikolov et. al. 2013] However, it was unclear how deep learning can be so successful. It looks like a black box with messy inputs and excellent outputs. So why is it so successful?

A friend of mine showed me this article in the preprint (arXiv:1410.3831) [Mehta & Schwab 2014] last year, which mathematically shows the equivalence of deep learning and renormalization group (RG). RG is a concept in theoretical physics that has been widely applied in different problems, including critical phenomena, self-organized criticality, particle physics, polymer physics, and strongly correlated electronic systems. And now, Mehta and Schwab showed that an explanation to the performance of deep learning is available through RG.

[Taken from http://www.inspiredeconomies.com/intelligibleecosystems/images/fractals/GasketMag.gif]

So what is RG? Before RG, Leo Kadanoff, a physics professor in University of Chicago, proposed an idea of coarse-graining in studying many-body problems in 1966. [Kadanoff 1966] In 1972, Kenneth Wilson and Michael Fisher succeeded in applying ɛ-expansion in perturbative RG to explain the critical exponents in Landau-Ginzburg-Wilson (LGW) Hamiltonian. [Wilson & Fisher 1972] This work has been the standard material of graduate physics courses. In 1974, Kenneth Wilson applied RG to explain the Kondo problem, which led to his Nobel Prize in Physics in 1982. [Wilson 1983]

RG assumes a system of scale invariance, which means the system are similar in whatever scale you are seeing. One example is the chaotic system as in Fig. 1. The system looks the same when you zoom in. We call this scale-invariant system self-similar. And physical systems closed to phase transition are self-similar. And if it is self-similar, Kadanoff’s idea of coarse-graining is then applicable, as in Fig. 2. Four spins can be viewed as one spin that “summarizes” the four spins in that block without changing the description of the physical system. This is somewhat like we “zoom out” the picture on Photoshop or Web Browser.

[Taken from [Singh 2014]]

So what’s the point of zooming out? Physicists care about the Helmholtz free energies of physical systems, which are similar to cost functions to the computer scientists and machine learning specialists. Both are to be minimized. However, whatever scale we are viewing at, the energy of the system should be scale-invariant. Therefore, as we zoom out, the system “changes” yet “looks the same” due to self-similarity, but the energy stays the same. The form of the model is unchanged, but the parameters change as the scale changes.

This is important, because this process tells us which parameters are relevant, and which others are irrelevant. Why? Think of it this way: we have an awesome computer to simulate a glass of water that contains 10²³ water molecules. To describe the systems, you have all parameters, including the position of molecules, strength of Van der Waals force, orbital angular momentum of each atom, strength of the covalent bonds, velocities of the molecules… You might have 10²⁵ parameters. However, this awesome computer cannot handle such a system with so many parameters. Then you try to coarse-grain the system, and you discard some parameters in each step of coarse-graining. After numerous steps, it turns out that the temperature and the pressure are the only relevant parameters.

RG helps you identify the relevant parameters.

And it is exactly what happened in deep learning. In each convolutional cycle, features that are not important are gradually discarded, and those that are important are kept and enhanced. Indeed, in computer vision and NLP, the data are so noisy that there are a lot of unnecessary information. Deep learning gradually discards these information. As Mehta and Schwab stated, [Mehta & Schwab 2014]

Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.

So what is the point of understanding this? Unlike other machine algorithms, we did not know how it works, which sometimes makes model building very difficult because we have no idea how to adjust parameters. I believe understanding its equivalence to RG helps guide us to build a model that works.

Charles Martin also wrote a blog entry with more demonstration about the equivalence of deep learning and RG. [Martin 2015]

Continue reading “Learning by Zooming Out” →

July 5, 2015 6

Partitioning Antibodies: HTJoinSolver

Icon of HTJoinSolver

Big data is also impacting biomedical research and clinical processes. In 1987, Susumu Tonegawa was awarded the Nobel Prize in Physiology or Medicine “for his discovery of the genetic principle for generation of antibody diversity.” The mechanism that forms the diverse antibodies in our bodies is called V(D)J recombination, or less commonly known as somatic recombination. [Tonegawa 1983] Since then, there are a lot of bioinformatician work on this.

Antibody (taken from Wikipedia)

Antibodies are often referred to as immunoglobulin (Ig). It is a combination of three gene segments: the Variety (V), the Diversity (D), and the Joining (J) gene segments. IMGT standardized the types of different segments of Ig.

There are a number of computational tools that partition the different segments of Ig. One of the most promising tools is the JoinSolver, developed by the collaboration between Center for Information Technology (CIT) and National Institute of Allergic and Infectious Disease (NIAID) of National Institutes of Health (NIH). [Souto-Carneiro, Longo, Russ, Sun & Lipsky 2004] However, as JoinSolver was not designed to handle insertions and deletions on gene sequences, a further improvement was needed. And the vast volume of sequence data urged a tool that can handle them with a more efficient algorithm, and run on multi-processing systems. It is how the idea of HTJoinSolver formed, where HT stands for “high-throughput.” Russ, Ho and Longo published their work in BMC Bioinformatics [Russ, Ho & Longo 2015], and the tool is available on their website.

HTJoinSolver is a collaboration of Division of Computational Biosciences (DCB) of CIT, and NIAID in NIH. HTJoinSolver partitions an Ig using an efficient dynamic programming (DP) algorithm that employs prior biological information of Ig. Usual DP algorithms, such as Smith-Waterman algorithm, compare two sequences with a full matrix of size mxn, where m and n are the lengths of the two sequences. (Refer to [Durbin, Eddy, Krogh & Mitchison 1998] for more details.) However, with a known motif in the V segment of Ig, TATTAGTGT, HTJoinSolver speeds up the comparison by filling the diagonals only, unless there are some variations that require full computation in some small regions of the matrix, as shown below:

Approximate DP algorithm in HTJoinSolver, taken from [Russ, Ho & Longo 2015]

After partitioning, the tool further analyzes the sequences, such as CDR3, excision, and mutation rate. This tool identifies various segments of Ig with extremely high accuracies even if the mutation probabilities of the Ig’s are as high as 30%. It speeds up the research and clinical process of immunologists and clinicians.

This is a very good example of big data in biomedical applications.

Continue reading “Partitioning Antibodies: HTJoinSolver” →

May 30, 2015 0

The Sexiest Job: About What?

(taken from Analyzing and Analyzers)

D. J. Patil, the Chief Data Scientist of the United States at the moment, coined the term “data scientist,” and called it “the sexiest job in the 21st century.” Therefore, we now have a job title called “data scientist,” which I have difficulties to categorize it into the Standard Occupational Classification (SOC) codes. While I respect D. J. Patil a lot (I love his speech in my commencement ceremony in University of Maryland), this is the job title that is the least defined job title ever seen in my life.

DJ Patil, the U. S. Chief Data Scientist (from his LinkedIn)

So what does a data scientist do? I have seen many articles about it. And various employers have different expectations about the data scientists they hired. Sometimes their expectation is so unreasonable in a way that they want a god. And a lot of people call themselves a data scientist in LinkedIn, despite the fact that their official titles are software engineers, software developers, data analysts, quantitative analysts, research scientists, researchers,… With a Ph.D. in theoretical physics, I want to call myself a data scientist too because of the word “scientist.” I found it cool and sexy. But I realize the risk of calling myself one: people expect something different from what I really am. I rather call myself an “applied quantitative researcher,” as shown in my LinkedIn.

Of course, it provides room for opportunists to make money by distorting their image and branding themselves in various ways from time to time.

Regarding the skills we need, I love the chart above. (Read that book, which is a good description.) Despite my complicated feelings toward the term “data scientist,” I believe as the R & D people in the big data era, we should know:

Statistics, Machine Learning, Natural Language Processing (NLP) and Information Retrieval (IR): the mathematical modeling part.
Domain Knowledge, or Business Knowledge: the knowledge about the industry, the world, the people, the company, …
Software Development: the skills of development cycle, such as object-oriented (OO) programming, functional programming, unit tests, …, and some recent technologies about distributed computing such as Hadoop and Spark.

Employers hired data scientists from diverse backgrounds. Statisticians, research scientists in machine learning, physicists, chemists, or mathematicians might know the mathematics and research methodologies very well, but they do not know how to write maintainable codes. This article described it well. On the other hand, some people are trained as a software developer. However, they do not have enough mathematical background to handle the analytics well.

The word “data” attracts the eyeballs, but we really need to define what these terms like “big data,” “data scientists,” or “data products” are. Yes, by the way, despite the vaguely-defined term “data products”, this article does describe the trend very well. But no matter what, there can only be more accessible data in this age of information explosion, any skills that tackle with data keep on being in high demand.

Continue reading “The Sexiest Job: About What?” →

May 23, 2015 0

Ranking Everything: an Overview of Link Analysis Using PageRank Algorithm

This is an age of quantification, meaning that we want to give everything, even qualitative, a number. In schools, teachers measure how good their students master mathematics by grading, or scoring their homework. The funding agencies measure how good a scientist is by counting the number of his publications, the citations, and the impact factors. We measure how successful a person is by his annual income. We can question all these approaches of measurement. Yet however good or bad the measures are, we look for a metric to measure.

Original PageRank Algorithm

We measure webpages too. In the early ages of Internet, people performed searching on sites such as Yahoo or AltaVista. The keywords they entered are the main information for the browser to do the searching. However, a big problem was that a large number of low quality or irrelevant webpages showed up in search results. Some were due to malicious manipulation of keyword tricks. Therefore, it gave rise a need to rank the webpages. Larry Page and Sergey Brin, the founders of Google, tackled this problem as a thesis topic in Stanford University. But this got commercialized, and Brin never received his Ph.D. They published their algorithm, called PageRank, named after Larry Page, at the Seventh International World Wide Web Conference (WWW7) in April 1998. [Brin & Page 1998] This algorithm is regarded as one of the top ten algorithms in data mining by a survey paper published in the IEEE International Conference on Data Mining (ICDM) in December 2006. [Wu et. al. 2008]

Larry Page and Sergey Brin (source)

The idea of the PageRank algorithm is very simple. It regards each webpage as a node, and each link in the webpage as a directional edge from the source to the target webpage. This forms a network, or a directed graph, of webpages connected by their links. A link is seen as a vote to the target homepage, and if the source homepage ranks high, it enhances the target homepage’s ranking as well. Mathematically it involves solving a large matrix using Newton-Raphson’s method. (Technologies involving handling the large matrix led to the MapReduce programming paradigm, another big data trend nowadays.)

Example (made by Python with packages networkx and matplotlib)

Let’s have an intuition through an example. In the network, we can easily see that “Big Data 1” has the highest rank because it has the most edges pointing to it. However, there are pages such as “Big Data Fake 1,” which looks like a big data page, but in fact it points to “Porn 1.” After running the PageRank algorithm, it does not have a high rank. The sample of the output is:

[('Big Data 1', 0.00038399273501500979),
('Artificial Intelligence', 0.00034612564364377323),
('Deep Learning 1', 0.00034221161094691966),
('Machine Learning 1', 0.00034177713235138173),
('Porn 1', 0.00033859136614724074),
('Big Data 2', 0.00033182629176238337),
('Spark', 0.0003305912073357307),
('Hadoop', 0.00032928389859040422),
('Dow-Jones 1', 0.00032368956852396916),
('Big Data 3', 0.00030969537721207128),
('Porn 2', 0.00030969537721207128),
('Big Data Fake 1', 0.00030735245262038724),
('Dow-Jones 2', 0.00030461420169420618),
('Machine Learning 2', 0.0003011838672138951),
('Deep Learning 2', 0.00029899313444392865),
('Econophysics', 0.00029810944592071552),
('Big Data Fake 2', 0.00029248837867043803),
('Wall Street', 0.00029248837867043803),
('Deep Learning 3', 0.00029248837867043803)]

You can see those pornographic webpages that pretend to be big data webpages do not have rank as high as those authentic ones. PageRank fights against spam and irrelevant webpages. Google later further improved the algorithm to combat more advanced tricks of spam pages.

You can refer other details in various sources and textbooks. [Rajaraman and Ullman 2011, Wu et. al. 2008]

Use in Social Media and Forums

Mathematically, the PageRank algorithm deals with a directional graph. As one can imagine, any systems that can be modeled as directional graph allow rooms for applying the PageRank algorithm. One extension of PageRank is ExpertiseRank.

Jun Zhang, Mark Ackerman and Lada Adamic published a conference paper in the International World Wide Web (WWW7) in May 2007. [Zhang, Ackerman & Adamic 2007] They investigated into a Java forum, by connecting users to posts and anyone replying to it as a directional graph. With an algorithm closely resembled PageRank, they found the experts and influential people in the forum.

Graphs in ExpertiseRank (take from [Zhang, Ackerman & Adamic 2007])

There are other algorithms like HITS (Hypertext induced topic selection) that does similar things. And social media such as Quora (and its Chinese counterpart, Zhihu) applied a link analysis algorithm (probabilistic topic network, see this.) to perform topic network building. Similar ideas are also applied to identify high-quality content in Yahoo! Answers. [Agichtein, Castillo, Donato, Gionis & Mishne 2008]

Use in Finance and Econophysics

PageRank algorithm is also applied outside information technology fields. Financial engineers and econophysicists applied an algorithm, called DebtRank, which is very similar to PageRank, to determine the systemically important financial institutions in a financial network. This work is published in Nature Scientific Reports. [Battiston, Puliga, Kaushik, Tasca & Caldarelli 2012] In their study, each node represents a financial institution, and a directional edge means the estimated potential impact of an institution to another one. Using DebtRank, we are able to identify the centrally important institutions that potentially impacted other institutions in the network once a financial crisis occurs.

DebtRank network, taken from [Battiston, Puliga, Kaushik, Tasca & Caldarelli 2012])

Continue reading “Ranking Everything: an Overview of Link Analysis Using PageRank Algorithm” →

May 20, 2015 1

Scala as the Next Influential Programming Language

I have been learning Scala. Some time ago, I doubted if it’s worth it as the learning curve is quite steep. But today I read the first chapter of my newly ordered book, titled Advanced Analytics Using Spark, a tool written in Scala for handling big data analytics, I reassured that I bet on the right thing.

I believe it will be the most common programming language the coming generation in this big data era because:

It runs on JVM: a lot of libraries have been maintained as Java packages. Why do we discard Java if everything is getting more perfect from time to time? It is the same reason why we do not discard our old Fortran codes in scientific computing, but to wrap them in MATLAB or Python.
It is an object-oriented: we learned about modularization and design patterns all the time. It keeps the strength of Java.
It is functional: analytics involve functions. We want to handle functions flexibly. It shortens our codes, and makes our codes more readable (provided that we write appropriately). Mathematical manipulation is easier when we can handle operations with fewer codes. Lambda expressions are available.
Interactive programming is available: what makes R and Python great is its availability to program interactively, especially handling data and mathematical models. And yes, this is also available in Scala.
Parallel computing comes naturally: with actors or additional packages like Spark, Scala is well suited for scalable huge data computing. This is something that R and Python lack.

Continue reading “Scala as the Next Influential Programming Language” →

May 17, 2015 1

EMBERS: predicting civil unrest real-time

I heard about this project, EMBERS (acronym to Early Model Based Event Recognition using Surrogates), in a DC Data Science meetup. The speaker was Naren Ramakrishnan from VirginiaTech.

To me, it is a real big data project. It is a software that forecasts massive atrocities, particularly on civil unrest (mainly in Latin America and Middle East). They make use of open-source indicators, such as tweets, Facebook events, news, blog posts, open economic figures etc. to predict the outbreak of big events with advanced mathematical models. It is collaborative project involves nine universities and private corporations.

EMBERS ingests a large amount of unstructured data 24/7. Evidently, techniques in natural language processing (NLP) are involved. Besides English, at least Spanish and Arabic are incorporated into the system. And this real-time prediction process is very challenging.

System architecture of EMBERS

Output screenshot of EMBERS

The system performance is quite good. For a 24-month period, it has a recall of 0.65 and a precision of 0.94.

Who need EMBERS? Government must be a big customer. And not surprisingly, some travelers, social scientists and corporate firms find it useful because safety in, information about and business environment in various countries are their main concerns. Of course it is not a free software. It is undeniably a lucrative project.

One of the many protests against the 2014 World Cup in Sao Paulo, May 15, 2014. NACHO DOCE/REUTERS

Continue reading “EMBERS: predicting civil unrest real-time” →

May 10, 2015 0

Hello world!

Welcome to this blog! I started this blog to share about ideas and projects in analytics and data science with colleagues and the general public!

I am a data scientist, an applied quantitative researcher. I specialize in data mining, natural language processing and machine learning. I held a Ph.D. in theoretical physics.

My blog posts may be categorized as:

BirdView: a project involving big data techniques without technical details,
CodieNerd: demonstration of ideas or algorithms with reasonable amount of codes,
MathAnalytics: a formal introduction of some algorithms with a fair amount of equations and proofs, and
DataCritics: comments on trends about the industry.

May 4, 2015 0