Partitioning Antibodies: HTJoinSolver

HTJoinSolver-YIcon of HTJoinSolver

Big data is also impacting biomedical research and clinical processes. In 1987,  Susumu Tonegawa was awarded the Nobel Prize in Physiology or Medicine “for his discovery of the genetic principle for generation of antibody diversity.” The mechanism that forms the diverse antibodies in our bodies is called V(D)J recombination, or less commonly known as somatic recombination. [Tonegawa 1983] Since then, there are a lot of bioinformatician work on this.

Antibody (taken from Wikipedia)

Antibodies are often referred to as immunoglobulin (Ig). It is a combination of three gene segments: the Variety (V), the Diversity (D), and the Joining (J) gene segments. IMGT standardized the types of different segments of Ig.

There are a number of computational tools that partition the different segments of Ig. One of the most promising tools is the JoinSolver, developed by the collaboration between Center for Information Technology (CIT) and National Institute of Allergic and Infectious Disease (NIAID) of National Institutes of Health (NIH). [Souto-Carneiro, Longo, Russ, Sun & Lipsky 2004] However, as JoinSolver was not designed to handle insertions and deletions on gene sequences, a further improvement was needed. And the vast volume of sequence data urged a tool that can handle them with a more efficient algorithm, and run on multi-processing systems. It is how the idea of HTJoinSolver formed, where HT stands for “high-throughput.” Russ, Ho and Longo published their work in BMC Bioinformatics [Russ, Ho & Longo 2015], and the tool is available on their website.

HTJoinSolver is a collaboration of Division of Computational Biosciences (DCB) of CIT, and NIAID in NIH. HTJoinSolver partitions an Ig using an efficient dynamic programming (DP) algorithm that employs prior biological information of Ig. Usual DP algorithms, such as Smith-Waterman algorithm, compare two sequences with a full matrix of size mxn, where m and n are the lengths of the two sequences. (Refer to [Durbin, Eddy, Krogh & Mitchison 1998] for more details.) However, with a known motif in the V segment of Ig, TATTAGTGT, HTJoinSolver speeds up the comparison by filling the diagonals only, unless there are some variations that require full computation in some small regions of the matrix, as shown below:

s12859-015-0589-x-2-lApproximate DP algorithm in HTJoinSolver, taken from [Russ, Ho & Longo 2015]

After partitioning, the tool further analyzes the sequences, such as CDR3, excision, and mutation rate. This tool identifies various segments of Ig with extremely high accuracies even if the mutation probabilities of the Ig’s are as high as 30%. It speeds up the research and clinical process of immunologists and clinicians.

This is a very good example of big data in biomedical applications.

Continue reading “Partitioning Antibodies: HTJoinSolver”


The Sexiest Job: About What?

(taken from Analyzing and Analyzers)

D. J. Patil, the Chief Data Scientist of the United States at the moment, coined the term “data scientist,” and called it “the sexiest job in the 21st century.” Therefore, we now have a job title called “data scientist,” which I have difficulties to categorize it into the Standard Occupational Classification (SOC) codes. While I respect D. J. Patil a lot (I love his speech in my commencement ceremony in University of Maryland), this is the job title that is the least defined job title ever seen in my life.

DJ Patil, the U. S. Chief Data Scientist (from his LinkedIn)

So what does a data scientist do? I have seen many articles about it. And various employers have different expectations about the data scientists they hired. Sometimes their expectation is so unreasonable in a way that they want a god. And a lot of people call themselves a data scientist in LinkedIn, despite the fact that their official titles are software engineers, software developers, data analysts, quantitative analysts, research scientists, researchers,… With a Ph.D. in theoretical physics, I want to call myself a data scientist too because of the word “scientist.” I found it cool and sexy. But I realize the risk of calling myself one: people expect something different from what I really am. I rather call myself an “applied quantitative researcher,” as shown in my LinkedIn.

Of course, it provides room for opportunists to make money by distorting their image and branding themselves in various ways from time to time.

Regarding the skills we need, I love the chart above. (Read that book, which is a good description.) Despite my complicated feelings toward the term “data scientist,” I believe as the R & D people in the big data era, we should know:

  1. Statistics, Machine Learning, Natural Language Processing (NLP) and Information Retrieval (IR): the mathematical modeling part.
  2. Domain Knowledge, or Business Knowledge: the knowledge about the industry, the world, the people, the company, …
  3. Software Development: the skills of development cycle, such as object-oriented (OO) programming, functional programming, unit tests, …, and some recent technologies about distributed computing such as Hadoop and Spark.

Employers hired data scientists from diverse backgrounds. Statisticians, research scientists in machine learning, physicists, chemists, or mathematicians might know the mathematics and research methodologies very well, but they do not know how to write maintainable codes. This article described it well. On the other hand, some people are trained as a software developer. However, they do not have enough mathematical background to handle the analytics well.

The word “data” attracts the eyeballs, but we really need to define what these terms like “big data,” “data scientists,” or “data products” are. Yes, by the way, despite the vaguely-defined term “data products”, this article does describe the trend very well. But no matter what, there can only be more accessible data in this age of information explosion, any skills that tackle with data keep on being in high demand.

Continue reading “The Sexiest Job: About What?”

Ranking Everything: an Overview of Link Analysis Using PageRank Algorithm

This is an age of quantification, meaning that we want to give everything, even qualitative, a number. In schools, teachers measure how good their students master mathematics by grading, or scoring their homework. The funding agencies measure how good a scientist is by counting the number of his publications, the citations, and the impact factors. We measure how successful a person is by his annual income. We can question all these approaches of measurement. Yet however good or bad the measures are, we look for a metric to measure.

Original PageRank Algorithm

We measure webpages too. In the early ages of Internet, people performed searching on sites such as Yahoo or AltaVista. The keywords they entered are the main information for the browser to do the searching. However, a big problem was that a large number of low quality or irrelevant webpages showed up in search results. Some were due to malicious manipulation of keyword tricks. Therefore, it gave rise a need to rank the webpages. Larry Page and Sergey Brin, the founders of Google, tackled this problem as a thesis topic in Stanford University. But this got commercialized, and Brin never received his Ph.D. They published their algorithm, called PageRank, named after Larry Page, at the Seventh International World Wide Web Conference (WWW7) in April 1998. [Brin & Page 1998] This algorithm is regarded as one of the top ten algorithms in data mining by a survey paper published in the IEEE International Conference on Data Mining (ICDM) in December 2006. [Wu et. al. 2008]

Larry Page and Sergey Brin (source)

The idea of the PageRank algorithm is very simple. It regards each webpage as a node, and each link in the webpage as a directional edge from the source to the target webpage. This forms a network, or a directed graph, of webpages connected by their links. A link is seen as a vote to the target homepage, and if the source homepage ranks high, it enhances the target homepage’s ranking as well. Mathematically it involves solving a large matrix using Newton-Raphson’s method. (Technologies involving handling the large matrix led to the MapReduce programming paradigm, another big data trend nowadays.)

Example (made by Python with packages networkx and matplotlib)

Let’s have an intuition through an example. In the network, we can easily see that “Big Data 1” has the highest rank because it has the most edges pointing to it. However, there are pages such as “Big Data Fake 1,” which looks like a big data page, but in fact it points to “Porn 1.” After running the PageRank algorithm, it does not have a high rank. The sample of the output is:

[('Big Data 1', 0.00038399273501500979),
('Artificial Intelligence', 0.00034612564364377323),
('Deep Learning 1', 0.00034221161094691966),
('Machine Learning 1', 0.00034177713235138173),
('Porn 1', 0.00033859136614724074),
('Big Data 2', 0.00033182629176238337),
('Spark', 0.0003305912073357307),
('Hadoop', 0.00032928389859040422),
('Dow-Jones 1', 0.00032368956852396916),
('Big Data 3', 0.00030969537721207128),
('Porn 2', 0.00030969537721207128),
('Big Data Fake 1', 0.00030735245262038724),
('Dow-Jones 2', 0.00030461420169420618),
('Machine Learning 2', 0.0003011838672138951),
('Deep Learning 2', 0.00029899313444392865),
('Econophysics', 0.00029810944592071552),
('Big Data Fake 2', 0.00029248837867043803),
('Wall Street', 0.00029248837867043803),
('Deep Learning 3', 0.00029248837867043803)]

You can see those pornographic webpages that pretend to be big data webpages do not have rank as high as those authentic ones. PageRank fights against spam and irrelevant webpages. Google later further improved the algorithm to combat more advanced tricks of spam pages.

You can refer other details in various sources and textbooks. [Rajaraman and Ullman 2011, Wu et. al. 2008]

Use in Social Media and Forums

Mathematically, the PageRank algorithm deals with a directional graph. As one can imagine, any systems that can be modeled as directional graph allow rooms for applying the PageRank algorithm. One extension of PageRank is ExpertiseRank.

Jun Zhang, Mark Ackerman and Lada Adamic published a conference paper in the International World Wide Web (WWW7) in May 2007. [Zhang, Ackerman & Adamic 2007] They investigated into a Java forum, by connecting users to posts and anyone replying to it as a directional graph. With an algorithm closely resembled PageRank, they found the experts and influential people in the forum.

Graphs in ExpertiseRank (take from [Zhang, Ackerman & Adamic 2007])

There are other algorithms like HITS (Hypertext induced topic selection) that does similar things. And social media such as Quora (and its Chinese counterpart, Zhihu) applied a link analysis algorithm (probabilistic topic network, see this.) to perform topic network building. Similar ideas are also applied to identify high-quality content in Yahoo! Answers. [Agichtein, Castillo, Donato, Gionis & Mishne 2008]

Use in Finance and Econophysics

PageRank algorithm is also applied outside information technology fields. Financial engineers and econophysicists applied an algorithm, called DebtRank, which is very similar to PageRank, to determine the systemically important financial institutions in a financial network. This work is published in Nature Scientific Reports. [Battiston, Puliga, Kaushik, Tasca & Caldarelli 2012] In their study, each node represents a financial institution, and a directional edge means the estimated potential impact of an institution to another one. Using DebtRank, we are able to identify the centrally important institutions that potentially impacted other institutions in the network once a financial crisis occurs.

ebtRank network, taken from [Battiston, Puliga, Kaushik, Tasca & Caldarelli 2012])

Continue reading “Ranking Everything: an Overview of Link Analysis Using PageRank Algorithm”

Scala as the Next Influential Programming Language

I have been learning Scala. Some time ago, I doubted if it’s worth it as the learning curve is quite steep. But today I read the first chapter of my newly ordered book, titled Advanced Analytics Using Spark, a tool written in Scala for handling big data analytics, I reassured that I bet on the right thing.

I believe it will be the most common programming language the coming generation in this big data era because:

  1. It runs on JVM: a lot of libraries have been maintained as Java packages. Why do we discard Java if everything is getting more perfect from time to time? It is the same reason why we do not discard our old Fortran codes in scientific computing, but to wrap them in MATLAB or Python.
  2. It is an object-oriented: we learned about modularization and design patterns all the time. It keeps the strength of Java.
  3. It is functional: analytics involve functions. We want to handle functions flexibly. It shortens our codes, and makes our codes more readable (provided that we write appropriately). Mathematical manipulation is easier when we can handle operations with fewer codes. Lambda expressions are available.
  4. Interactive programming is available: what makes R and Python great is its availability to program interactively, especially handling data and mathematical models. And yes, this is also available in Scala.
  5. Parallel computing comes naturally: with actors or additional packages like Spark, Scala is well suited for scalable huge data computing. This is something that R and Python lack.


Continue reading “Scala as the Next Influential Programming Language”

EMBERS: predicting civil unrest real-time

I heard about this project, EMBERS (acronym to Early Model Based Event Recognition using Surrogates), in a DC Data Science meetup. The speaker was Naren Ramakrishnan from VirginiaTech.

To me, it is a real big data project. It is a software that forecasts massive atrocities, particularly on civil unrest (mainly in Latin America and Middle East). They make use of open-source indicators, such as tweets, Facebook events, news, blog posts, open economic figures etc. to predict the outbreak of big events with advanced mathematical models. It is collaborative project involves nine universities and private corporations.

EMBERS ingests a large amount of unstructured data 24/7. Evidently, techniques in natural language processing (NLP) are involved. Besides English, at least Spanish and Arabic are incorporated into the system. And this real-time prediction process is very challenging.

EMBERSarchitectureSystem architecture of EMBERS

EMBERSoutputOutput screenshot of EMBERS

The system performance is quite good. For a 24-month period, it has a recall of 0.65 and a precision of 0.94.

Who need EMBERS? Government must be a big customer. And not surprisingly, some travelers, social scientists and corporate firms find it useful because safety in, information about and business environment in various countries are their main concerns. Of course it is not a free software. It is undeniably a lucrative project.

One of the many protests against the 2014 World Cup in Sao Paulo, May 15, 2014. NACHO DOCE/REUTERS

Continue reading “EMBERS: predicting civil unrest real-time”

Hello world!

Welcome to this blog! I started this blog to share about ideas and projects in analytics and data science with colleagues and the general public!

I am a data scientist, an applied quantitative researcher. I specialize in data mining, natural language processing and machine learning. I held a Ph.D. in theoretical physics.

My blog posts may be categorized as:

  1. BirdView: a project involving big data techniques without technical details,
  2. CodieNerd: demonstration of ideas or algorithms with reasonable amount of codes,
  3. MathAnalytics: a formal introduction of some algorithms with a fair amount of equations and proofs, and
  4. DataCritics: comments on trends about the industry.

Blog at

Up ↑