There are many tasks that involve coding, for example, putting kids into groups according to their age, labeling the webpages about their kinds, or putting students in Hogwarts into four colleges… And researchers or lawyers need to code people, according to their filled-in information, into occupations. Melissa Friesen, an investigator in Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NCI), National Institutes of Health (NIH), saw the need of large-scale coding. Many researchers are dealing with big data concerning epidemiology. She led a research project, in collaboration with Office of Intramural Research (OIR), Center for Information Technology (CIT), National Institutes of Health (NIH), to develop an artificial intelligence system to cope with the problem. This leads to a publicly available tool called SOCcer, an acronym for “Standardized Occupation Coding for Computer-assisted Epidemiological Research.” (URL: http://soccer.nci.nih.gov/soccer/)

The system was initially developed in an attempt to find the correlation between the onset of cancers and other diseases and the occupation. “The application is not intended to replace expert coders, but rather to prioritize which job descriptions would benefit most from expert review,” said Friesen in an interview. She mainly works with Daniel Russ in CIT.

SOCcer takes job title, industry codes (in terms of SIC, Standard Industrial Classification), and job duties, and gives an occupational code called SOC 2010 (Standard Occupational Classification), used by U. S. federal government agencies. The data involves short text, often messy. There are 840 codes in SOC 2010 systems. Conventional natural language processing (NLP) methods may not apply. Friesen, Russ, and Kwan-Yuet (Stephen) Ho (also in OIR, CIT; a CSRA staff) use fuzzy logic, and maximum entropy (maxent) methods, with some feature engineering, to build various classifiers. These classifiers are aggregated together, as in stacked generalization (see my previous entry), using logistic regression, to give a final score.

SOCcer has a companion software, called SOCAssign, for expert coders to prioritize the codings. It was awarded with DCEG Informatics Tool Challenge 2015. SOCcer itself was awarded in 2016. And the SOCcer team was awarded for Scientific Award of Merit by CIT/OCIO in 2016 as well (see this). Their work was published in Occup. Environ. Med.

The topic of word embedding algorithms has been one of the interests of this blog, as in this entry, with Word2Vec [Mikilov et. al. 2013] as one of the main examples. It is a great tool for text mining, (for example, see [Czerny 2015],) as it reduces the dimensions needed (compared to bag-of-words model). As an algorithm borrowed from computer vision, a lot of these algorithms use deep learning methods to train the model, while it was not exactly sure why it works. Despite that, there are many articles talking about how to train the model. [Goldberg & Levy 2014, Rong 2014 etc.] Addition and subtraction of the word vectors show amazing relationships that carry semantic values, as I have shown in my previous blog entry. [Ho 2015]

However, Tomas Mikolov is no longer working in Google, making the development of this algorithm discontinued. As a follow-up of their work, Stanford NLP group later proposed a model, called GloVe (Global Vectors), that embeds word vectors using probabilistic methods. [Pennington, Socher & Manning 2014] It can be implemented in the package glove-python in python, and text2vec in R (or see their CRAN post). Their paper is neatly written, and a highly recommended read.

To explain the theory of GloVe, we start with some basic probabilistic picture in basic natural language processing (NLP). We suppose the relation between the words occur in certain text windows within a corpus, but the details are not important here. Assume that , , and are three words, and the conditional probability is defined as

,

where ‘s are the counts, and similarly for . And we are interested in the following ratio:

.

The tilde means “context,” but we will later assume it is also a word. Citing the example from their paper, take as ice, and as steam. if is solid, then the ratio is expected to be large; or if is gas, then it is expected to be low. But if is water, which are related to both, or fashion, which is related to none, then the ratio is expected to be approximately 1.

And the addition and subtraction in Word2Vec is similar to this. We want the ratio to be like the subtraction as in Word2Vec (and multiplication as in addition), then we should modify the function such that,

.

On the other hand, the input arguments of are vectors, but the output is a scalar. We avoid the issue by making the input argument as a dot product,

.

In NLP, the word-word co-occurrence matrices are symmetric, and our function should also be invariant under switching the labeling. We first require is be a homomorphism,

,

where we define,

.

It is clear that is an exponential function, but to ensure symmetry, we further define:

.

As a result of this equation, the authors defined the following cost function to optimize for GloVe model:

,

where , , , and are parameters to learn. is a weighting function. (Refer the details to the paper.) [Pennington, Socher & Manning 2014]

As Radim Řehůřek said in his blog entry, [Řehůřek 2014] it is a neat paper, but their evaluation is crappy.

This theory explained why certain similar relations can be achieved, such as Paris – France is roughly equal to Beijing – China, as both can be transformed to the ratio in the definition of above.

It is a neat paper, as it employs optimization theory and probability theory, without any dark box deep learning.

Previously, I have went through heuristically the description of topology using homology groups in this entry. [Ho 2015] This is the essence of algebraic topology. We describe the topology using Betti numbers, the rank of the homolog groups. What they mean can be summarized as: [Bubenik 2015]

“… homology in degree 0 describes the connectedness of the data; homology in degree 1 detects holes and tunnels; homology in degree 2 captures voids; and so on.

Concept of Persistence

However, in computational problems, it is the discrete points that we are dealing with. We formulate their connectedness through constructing complexes, as described by my another blog entry. [Ho 2015] From the Wolfram Demonstration that I quoted previously, connectedness depends on some parameters, such as the radii of points that are considered connected. Whether it is Čech Complex, RP complex, or Alpha complex, the idea is similar. With discrete data, therefore, there is no definite answer how the connectedness among the points are, as it depends on the parameters.

Therefore, the concept of persistence has been developed to tackle this problem. This is the core concept for computational topology. There are a lot of papers about persistence, but the most famous work is done by Zomorodian and Carlsson, who algebraically studied it. [Zomorodian & Carlsson 2005] The idea is that as one increases the radii of points, the complexes change, and so do the homology groups. By varying the radii, we can observe which topology persists.

From the diagram above, we can see that as the radii ε increase, the diagram becomes more connected. To understand the changes of homologies, there are a few ways. In the diagram above, barcodes represent the “life span” of a connected component as ε increases. The Betti numbers of a certain degree (0, 1, or 2 in this example) at a certain value of ε is the number of barcodes at that degree. For example, look at the left most vertical dashed line, , as there are 10 barcodes existing for . Note there are indeed 10 separate connected components. For the second leftmost vertical dashed line, (6 connected components), and (2 holes).

Another way is using the persistence diagram, basically plotting the “birth” and “death” times of all the barcodes above. For an explanation of persistence diagram, please refer to this blog entry by Sebastien Bubeck, [Bubeck 2013] or the paper by Fasy et. al. [Fasy et. al. 2014] Another way to describe persistent topology is the persistence landscape. [Bubenik 2015]

TDA Package in R

There are a lot of tools to perform topological data analysis. Ayasdi Core is a famous one. There are open sources C++ libraries such as Dionysus, or PHAT. There is a Python binding for Dionysus too.

There is a package in R that wraps Dionysus and PHAT, called TDA. To install it, simply open an R session, and enter

install.package('TDA')

To load it, simply enter

library(TDA)

We know that for a circle, , as it has on connected components, and a hole. Prepare the circle and store it in X by the function circleUnif:

X<- circleUnif(n=1000, r=1)
plot(X)

Then we can see a 2-dimensional circle like this:

To calculate the persistent homology, use the function gridDiag:

On October 14, 2015, I attended the regular meeting of the DCNLP meetup group, a group on natural language processing (NLP) in Washington, DC area. The talk was titled “Deep Learning for Question Answering“, spoken by Mr. Mohit Iyyer, a Ph.D. student in Department of Computer Science, University of Maryland (my alma mater!). He is a very good speaker.

I have no experience on deep learning at all although I did write a blog post remotely related. I even didn’t start training my first neural network until the next day after the talk. However, Mr. Iyyer explained what recurrent neural network (RNN), recursive neural network, and deep averaging network (DAN) are. This helped me a lot in order to understanding more about the principles of the famous word2vec model (which is something I am going to write about soon!). You can refer to his slides for more details. There are really a lot of talents in College Park, like another expert, Joe Yue Hei Ng, who is exploiting deep learning a lot as well.

The applications are awesome: with external knowledge to factual question answering, reasoning-based question answering, and visual question answering, with increasing order of challenging levels.

Mr. Iyyer and the participants discussed a lot about different packages. Mr. Iyyer uses Theano, a Python package for deep learning, which is good for model building and other analytical work. Some prefer Caffe. Some people, who are Java developers, also use deeplearning4j.

This meetup was a sacred one too, because it is the last time it was held in Stetsons Famous Bar & Grill at U Street, which is going to permanently close on Halloween this year. The group is eagerly looking for a new venue for the upcoming meetup. This meeting was a crowded one. I sincerely thank the organizers, Charlie Greenbacker and Liz Merkhofer, for hosting all these meetings, and Chris Phipps (a linguist from IBM Watson) for recording.

In my previous blog post, I introduced the newly emerged topological data analysis (TDA). Unlike most of the other data analytic algorithms, TDA, concerning the topology as its name tells, cares for the connectivity of points, instead of the distance (according to a metric, whether it is Euclidean, Manhattan, Minkowski or any other). What is the best tools to describe topology?

Physicists use a lot homotopy. But for the sake of computation, it is better to use a scheme that are suited for discrete computation. It turns out that there are useful tools in algebraic topology: homology. But to understand homology, we need to understand what a simplicial complex is.

Gunnar Carlsson [Carlsson 2009] and Afra Zomorodian [Zomorodian 2011] wrote good reviews about them, although from a different path in introducing the concept. I first followed Zomorodian’s review [Zomorodian 2011], then his book [Zomorodian 2009] that filled in a lot of missing links in his review, to a certain point. I recently started reading Carlsson’s review.

One must first understand what a simplicial complex is. Without giving too much technical details, simplicial complex is basically a shape connecting points together. A line is a 1-simplex, connecting two points. A triangle is a 2-simplex. A tetrahedron is a 3-complex. There are other more complicated and unnamed complexes. Any subsets of a simplicial complex are faces. For example, the sides of the triangle are faces. The faces and the sides are the faces of the tetrahedron. (Refer to Wolfram MathWorld for more details. There are a lot of good tutorials online.)

Implementing Simplicial Complex

We can easily encoded this into a python code. I wrote a class SimplicialComplex in Python to implement this. We first import necessary libraries:

import numpy as np
from itertools import combinations
from scipy.sparse import dok_matrix
from operator import add

The first line imports the numpy library, the second the iteration tools necessary for extracting the faces for simplicial complex, the third the sparse matrix implementation in the scipy library (applied on something that I will not go over in this blog entry), and the fourth for some reduce operation.

We want to describe the simplicial complexes in the order of some labels (which can be anything, such as integers or strings). If it is a point, then it can be represented as tuples, as below:

(1,)

Or if it is a line (a 1-simplex), then

(1, 2)

Or a 2-simplex as a triangle, then

(1, 2, 3)

I think you get the gist. The integers 1, 2, or 3 here are simply labels. We can easily store this in the class:

You might observe the last line of the codes above. And it is for calculating all the faces of this complex, and it is implemented in this way:

def faces(self):
faceset = set()
for simplex in self.simplices:
numnodes = len(simplex)
for r in range(numnodes, 0, -1):
for face in combinations(simplex, r):
faceset.add(face)
return faceset

The faces are intuitively sides of a 2D shape (2-simplex), or planes of a 3D shape (3-simplex). But the faces of a 3-simplex includes the faces of all its faces. All the faces are saved in a field called faceset. If the user wants to retrieve the faces in a particular dimension, they can call this method:

We have gone over the basis of simplicial complex, which is the foundation of TDA. We appreciate that the simplicial complex deals only with the connectivity of points instead of the distances between the points. And the homology groups will be calculated based on this. However, how do we obtain the simplicial complex from the discrete data we have? Zomorodian’s review [Zomorodian 2011] gave a number of examples, but I will only go through two of them only. And from this, you can see that to establish the connectivity between points, we still need to apply some sort of distance metrics.

Alpha Complex

An alpha complex is the nerve of the cover of the restricted Voronoi regions. (Refer the details to Zomorodian’s review [Zomorodian 2011], this Wolfram MathWorld entry, or this Wolfram Demonstration.) We can extend the class SimplicialComplex to get a class AlphaComplex:

from scipy.spatial import Delaunay, distance
from operator import or_
from functools import partial
def facesiter(simplex):
for i in range(len(simplex)):
yield simplex[:i]+simplex[(i+1):]
def flattening_simplex(simplices):
for simplex in simplices:
for point in simplex:
yield point
def get_allpoints(simplices):
return set(flattening_simplex(simplices))
def contain_detachededges(simplex, distdict, epsilon):
if len(simplex)==2:
return (distdict[simplex[0], simplex[1]] > 2*epsilon)
else:
return reduce(or_, map(partial(contain_detachededges, distdict=distdict, epsilon=epsilon), facesiter(simplex)))
class AlphaComplex(SimplicialComplex):
def __init__(self, points, epsilon, labels=None, distfcn=distance.euclidean):
self.pts = points
self.labels = range(len(self.pts)) if labels==None or len(labels)!=len(self.pts) else labels
self.epsilon = epsilon
self.distfcn = distfcn
self.import_simplices(self.construct_simplices(self.pts, self.labels, self.epsilon, self.distfcn))
def calculate_distmatrix(self, points, labels, distfcn):
distdict = {}
for i in range(len(labels)):
for j in range(len(labels)):
distdict[(labels[i], labels[j])] = distfcn(points[i], points[j])
return distdict
def construct_simplices(self, points, labels, epsilon, distfcn):
delaunay = Delaunay(points)
delaunay_simplices = map(tuple, delaunay.simplices)
distdict = self.calculate_distmatrix(points, labels, distfcn)
simplices = []
for simplex in delaunay_simplices:
faces = list(facesiter(simplex))
detached = map(partial(contain_detachededges, distdict=distdict, epsilon=epsilon), faces)
if reduce(or_, detached):
if len(simplex)>2:
for face, notkeep in zip(faces, detached):
if not notkeep:
simplices.append(face)
else:
simplices.append(simplex)
simplices = map(lambda simplex: tuple(sorted(simplex)), simplices)
simplices = list(set(simplices))
allpts = get_allpoints(simplices)
for point in (set(labels)-allpts):
simplices += [(point,)]
return simplices

The scipy package already has a package to calculate Delaunay region. The function contain_detachededges is for constructing the restricted Voronoi region from the calculated Delaunay region.

This class demonstrates how an Alpha Complex is constructed, but this runs slowly once the number of points gets big!

Vietoris-Rips (VR) Complex

Another commonly used complex is called the Vietoris-Rips (VR) Complex, which connects points as the edge of a graph if they are close enough. (Refer to Zomorodian’s review [Zomorodian 2011] or this Wikipedia page for details.) To implement this, import the famous networkx originally designed for network analysis.

import networkx as nx
from scipy.spatial import distance
from itertools import product
class VietorisRipsComplex(SimplicialComplex):
def __init__(self, points, epsilon, labels=None, distfcn=distance.euclidean):
self.pts = points
self.labels = range(len(self.pts)) if labels==None or len(labels)!=len(self.pts) else labels
self.epsilon = epsilon
self.distfcn = distfcn
self.network = self.construct_network(self.pts, self.labels, self.epsilon, self.distfcn)
self.import_simplices(map(tuple, list(nx.find_cliques(self.network))))
def construct_network(self, points, labels, epsilon, distfcn):
g = nx.Graph()
g.add_nodes_from(labels)
zips = zip(points, labels)
for pair in product(zips, zips):
if pair[0][1]!=pair[1][1]:
dist = distfcn(pair[0][0], pair[1][0])
if dist<epsilon:
g.add_edge(pair[0][1], pair[1][1])
return g

The intuitiveness and efficiencies are the reasons that VR complexes are widely used.

For more details about the Alpha Complexes, VR Complexes and the related Čech Complexes, refer to this page.

More…

There are other commonly used complexes used, including Witness Complex, Cubical Complex etc., which I leave no introductions. Upon building the complexes, we can analyze the topology by calculating their homology groups, Betti numbers, the persistent homology etc. I wish to write more about it soon.

I have seen more than enough debates about R or Python. While I do have a preference towards Python, I am happy with using R as well. I am not agnostic about languages, but we choose tools according to needs. The needs may be about effectiveness, efficiency, availability of tools, nature of problems, collaborations, etc. Yes, in a nutshell, it depends.

When dealing with text mining, although I still prefer Python, I have to fairly say that both languages have their own strengths and weaknesses. What do you do in text mining? Let me casually list the usual steps:

Removing special characters,

Removing numerals,

Converting all alphabets to lower cases,

Removing stop words, and

Stemming the words (using Porter stemmer).

They are standard steps. But of course, sometimes we perform lemmatization instead of stemming. Sometimes we keep numerals. Or whatever. It is okay.

How do u do that in Python? Suppose you have a list of text documents stored in the variable texts, which is defined by

texts = ['I love Python.',
'R is good for analytics.',
'Mathematics is fun.']

. Then

# import all necessary libraries
from nltk.stem import PorterStemmer
from nltk.tokenize import SpaceTokenizer
from nltk.corpus import stopwords
from functools import partial
from gensim import corpora
from gensim.models import TfidfModel
import re
# initialize the instances for various NLP tools
tokenizer = SpaceTokenizer()
stemmer = PorterStemmer()
# define each steps
pipeline = [lambda s: re.sub('[^\w\s]', '', s),
lambda s: re.sub('[\d]', '', s),
lambda s: s.lower(),
lambda s: ' '.join(filter(lambda s: not (s in stopwords.words()), tokenizer.tokenize(s))),
lambda s: ' '.join(map(lambda t: stemmer.stem(t), tokenizer.tokenize(s)))
]
# function that carries out the pipeline step-by-step
def preprocess_text(text, pipeline):
if len(pipeline)==0:
return text
else:
return preprocess_text(pipeline[0](text), pipeline[1:])
# preprocessing
preprocessed_texts = map(partial(preprocess_text, pipeline=pipeline), texts)
# converting to feature vectors
documents = map(lambda s: tokenizer.tokenize(s), texts)
corpus = [dictionary.doc2bow(document) for document in documents]
tfidfmodel = TfidfModel(corpus)

We can train a classifier with the feature vectors output by tfidfmodel. To do the prediction, we can get the feature vector for a new text by calling:

Actually, from this illustration, a strength for R stands out: brevity. However, very often we want to preprocess in other ways, Python allows more flexibility without making it complicated. And Python syntax itself is intuitive enough.

And there are more natural language processing libraries in Python available, such as nltk and gensim, that are associated with its other libraries perfectly such as numpy, scipy and scikit-learn. But R is not far away in terms of this actually, as it has libraries such as tm and RTextTools, while R does not have numpy-like libraries because R itself is designed to perform calculations like this.

Python can be used to develop larger software projects by making the codes reusable, and it is obviously a weakness for R.

However, do perform analysis, R makes the task very efficient if we do not require something unconventional.

In the area of text mining, R or Python? My answer is: it depends.

Deep learning, a collection of related neural network algorithms, has been proved successful in certain types of machine learning tasks in computer vision, speech recognition, data cleaning, and natural language processing (NLP). [Mikolov et. al. 2013] However, it was unclear how deep learning can be so successful. It looks like a black box with messy inputs and excellent outputs. So why is it so successful?

A friend of mine showed me this article in the preprint (arXiv:1410.3831) [Mehta & Schwab 2014] last year, which mathematically shows the equivalence of deep learning and renormalization group (RG). RG is a concept in theoretical physics that has been widely applied in different problems, including critical phenomena, self-organized criticality, particle physics, polymer physics, and strongly correlated electronic systems. And now, Mehta and Schwab showed that an explanation to the performance of deep learning is available through RG.

So what is RG? Before RG, Leo Kadanoff, a physics professor in University of Chicago, proposed an idea of coarse-graining in studying many-body problems in 1966. [Kadanoff 1966] In 1972, Kenneth Wilson and Michael Fisher succeeded in applying ɛ-expansion in perturbative RG to explain the critical exponents in Landau-Ginzburg-Wilson (LGW) Hamiltonian. [Wilson & Fisher 1972] This work has been the standard material of graduate physics courses. In 1974, Kenneth Wilson applied RG to explain the Kondo problem, which led to his Nobel Prize in Physics in 1982. [Wilson 1983]

RG assumes a system of scale invariance, which means the system are similar in whatever scale you are seeing. One example is the chaotic system as in Fig. 1. The system looks the same when you zoom in. We call this scale-invariant system self-similar. And physical systems closed to phase transition are self-similar. And if it is self-similar, Kadanoff’s idea of coarse-graining is then applicable, as in Fig. 2. Four spins can be viewed as one spin that “summarizes” the four spins in that block without changing the description of the physical system. This is somewhat like we “zoom out” the picture on Photoshop or Web Browser.

[Taken from [Singh 2014]]

So what’s the point of zooming out? Physicists care about the Helmholtz free energies of physical systems, which are similar to cost functions to the computer scientists and machine learning specialists. Both are to be minimized. However, whatever scale we are viewing at, the energy of the system should be scale-invariant. Therefore, as we zoom out, the system “changes” yet “looks the same” due to self-similarity, but the energy stays the same. The form of the model is unchanged, but the parameters change as the scale changes.

This is important, because this process tells us which parameters are relevant, and which others are irrelevant. Why? Think of it this way: we have an awesome computer to simulate a glass of water that contains 10^{23} water molecules. To describe the systems, you have all parameters, including the position of molecules, strength of Van der Waals force, orbital angular momentum of each atom, strength of the covalent bonds, velocities of the molecules… You might have 10^{25} parameters. However, this awesome computer cannot handle such a system with so many parameters. Then you try to coarse-grain the system, and you discard some parameters in each step of coarse-graining. After numerous steps, it turns out that the temperature and the pressure are the only relevant parameters.

RG helps you identify the relevant parameters.

And it is exactly what happened in deep learning. In each convolutional cycle, features that are not important are gradually discarded, and those that are important are kept and enhanced. Indeed, in computer vision and NLP, the data are so noisy that there are a lot of unnecessary information. Deep learning gradually discards these information. As Mehta and Schwab stated, [Mehta & Schwab 2014]

Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.

So what is the point of understanding this? Unlike other machine algorithms, we did not know how it works, which sometimes makes model building very difficult because we have no idea how to adjust parameters. I believe understanding its equivalence to RG helps guide us to build a model that works.

Charles Martin also wrote a blog entry with more demonstration about the equivalence of deep learning and RG. [Martin 2015]

Python is the basic programming languages if one wants to work on data nowadays. Its popularity comes with its intuitive syntax, its support of several programming paradigms, and the package numpy (Numerical Python). Yes, if you asked which package is a “must-have” outside the standard Python packages, I would certainly name numpy.

Let me list some useful packages that I have found useful:

numpy: Numerical Python. Its basic data type is ndarray, which acts like a vector with vectorized calculation support. It makes Python to perform matrix calculation efficiently like MATLAB and Octave. It supports a lot of commonly used linear algebraic algorithms, such as eigenvalue problems, SVD etc. It is the basic of a lot of other Python packages that perform heavy numerical computation. It is such an important package that, in some operating systems, numpy comes with Python as well.

scipy: Scientific Python. It needs numpy, but it supports also sparse matrices, special functions, statistics, numerical integration…

scikit-learn: machine learning library. It contains a number of supervised and unsupervised learning algorithms.

nltk: natural language processing. It provides not only basic tools like stemmers, lemmatizers, but also some algorithms like maximum entropy, tf-idf vectorizer etc. It provides a few corpuses, and supports WordNet dictionary.

gensim: another useful natural language processing package with an emphasis on topic modeling. It mainly supports Word2Vec, latent semantic indexing (LSI), and latent Dirichlet allocation (LDA). It is convenient to construct term-document matrices, and convert them to matrices in numpy or scipy.

networkx: a package that supports both undirected and directed graphs. It provides basic algorithms used in graphs.

sympy: Symbolic Python. I am not good at this package, but I know mathics and SageMath are both based on it.

pandas: it supports data frame handling like R. (I have not used this package as I am a heavy R user.)

Of course, if you are a numerical developer, to save you a good life, install Anaconda.

There are some other packages that are useful, such as PyCluster (clustering), xlrd (Excel files read/write), PyGame (writing games)… But since I have not used them, I would rather mention it in this last paragraph, not to endorse but avoid devaluing it.

D. J. Patil, the Chief Data Scientist of the United States at the moment, coined the term “data scientist,” and called it “the sexiest job in the 21st century.” Therefore, we now have a job title called “data scientist,” which I have difficulties to categorize it into the Standard Occupational Classification (SOC) codes. While I respect D. J. Patil a lot (I love his speech in my commencement ceremony in University of Maryland), this is the job title that is the least defined job title ever seen in my life.

DJ Patil, the U. S. Chief Data Scientist (from his LinkedIn)

So what does a data scientist do? I have seen many articles about it. And various employers have different expectations about the data scientists they hired. Sometimes their expectation is so unreasonable in a way that they want a god. And a lot of people call themselves a data scientist in LinkedIn, despite the fact that their official titles are software engineers, software developers, data analysts, quantitative analysts, research scientists, researchers,… With a Ph.D. in theoretical physics, I want to call myself a data scientist too because of the word “scientist.” I found it cool and sexy. But I realize the risk of calling myself one: people expect something different from what I really am. I rather call myself an “applied quantitative researcher,” as shown in my LinkedIn.

Of course, it provides room for opportunists to make money by distorting their image and branding themselves in various ways from time to time.

Regarding the skills we need, I love the chart above. (Read that book, which is a good description.) Despite my complicated feelings toward the term “data scientist,” I believe as the R & D people in the big data era, we should know:

Statistics, Machine Learning, Natural Language Processing (NLP) and Information Retrieval (IR): the mathematical modeling part.

Domain Knowledge, or Business Knowledge: the knowledge about the industry, the world, the people, the company, …

Software Development: the skills of development cycle, such as object-oriented (OO) programming, functional programming, unit tests, …, and some recent technologies about distributed computing such as Hadoop and Spark.

Employers hired data scientists from diverse backgrounds. Statisticians, research scientists in machine learning, physicists, chemists, or mathematicians might know the mathematics and research methodologies very well, but they do not know how to write maintainable codes. This article described it well. On the other hand, some people are trained as a software developer. However, they do not have enough mathematical background to handle the analytics well.

The word “data” attracts the eyeballs, but we really need to define what these terms like “big data,” “data scientists,” or “data products” are. Yes, by the way, despite the vaguely-defined term “data products”, this article does describe the trend very well. But no matter what, there can only be more accessible data in this age of information explosion, any skills that tackle with data keep on being in high demand.

This is an age of quantification, meaning that we want to give everything, even qualitative, a number. In schools, teachers measure how good their students master mathematics by grading, or scoring their homework. The funding agencies measure how good a scientist is by counting the number of his publications, the citations, and the impact factors. We measure how successful a person is by his annual income. We can question all these approaches of measurement. Yet however good or bad the measures are, we look for a metric to measure.

Original PageRank Algorithm

We measure webpages too. In the early ages of Internet, people performed searching on sites such as Yahoo or AltaVista. The keywords they entered are the main information for the browser to do the searching. However, a big problem was that a large number of low quality or irrelevant webpages showed up in search results. Some were due to malicious manipulation of keyword tricks. Therefore, it gave rise a need to rank the webpages. Larry Page and Sergey Brin, the founders of Google, tackled this problem as a thesis topic in Stanford University. But this got commercialized, and Brin never received his Ph.D. They published their algorithm, called PageRank, named after Larry Page, at the Seventh International World Wide Web Conference (WWW7) in April 1998. [Brin & Page 1998] This algorithm is regarded as one of the top ten algorithms in data mining by a survey paper published in the IEEE International Conference on Data Mining (ICDM) in December 2006. [Wu et. al. 2008]

The idea of the PageRank algorithm is very simple. It regards each webpage as a node, and each link in the webpage as a directional edge from the source to the target webpage. This forms a network, or a directed graph, of webpages connected by their links. A link is seen as a vote to the target homepage, and if the source homepage ranks high, it enhances the target homepage’s ranking as well. Mathematically it involves solving a large matrix using Newton-Raphson’s method. (Technologies involving handling the large matrix led to the MapReduce programming paradigm, another big data trend nowadays.)

Let’s have an intuition through an example. In the network, we can easily see that “Big Data 1” has the highest rank because it has the most edges pointing to it. However, there are pages such as “Big Data Fake 1,” which looks like a big data page, but in fact it points to “Porn 1.” After running the PageRank algorithm, it does not have a high rank. The sample of the output is:

You can see those pornographic webpages that pretend to be big data webpages do not have rank as high as those authentic ones. PageRank fights against spam and irrelevant webpages. Google later further improved the algorithm to combat more advanced tricks of spam pages.

You can refer other details in various sources and textbooks. [Rajaraman and Ullman 2011, Wu et. al. 2008]

Use in Social Media and Forums

Mathematically, the PageRank algorithm deals with a directional graph. As one can imagine, any systems that can be modeled as directional graph allow rooms for applying the PageRank algorithm. One extension of PageRank is ExpertiseRank.

Jun Zhang, Mark Ackerman and Lada Adamic published a conference paper in the International World Wide Web (WWW7) in May 2007. [Zhang, Ackerman & Adamic 2007] They investigated into a Java forum, by connecting users to posts and anyone replying to it as a directional graph. With an algorithm closely resembled PageRank, they found the experts and influential people in the forum.

Graphs in ExpertiseRank (take from [Zhang, Ackerman & Adamic 2007])

There are other algorithms like HITS (Hypertext induced topic selection) that does similar things. And social media such as Quora (and its Chinese counterpart, Zhihu) applied a link analysis algorithm (probabilistic topic network, see this.) to perform topic network building. Similar ideas are also applied to identify high-quality content in Yahoo! Answers. [Agichtein, Castillo, Donato, Gionis & Mishne 2008]

Use in Finance and Econophysics

PageRank algorithm is also applied outside information technology fields. Financial engineers and econophysicists applied an algorithm, called DebtRank, which is very similar to PageRank, to determine the systemically important financial institutions in a financial network. This work is published in Nature Scientific Reports. [Battiston, Puliga, Kaushik, Tasca & Caldarelli 2012] In their study, each node represents a financial institution, and a directional edge means the estimated potential impact of an institution to another one. Using DebtRank, we are able to identify the centrally important institutions that potentially impacted other institutions in the network once a financial crisis occurs.

DebtRank network, taken from [Battiston, Puliga, Kaushik, Tasca & Caldarelli 2012])