Effective 12/23/2015, the title of the entries in this blog will no longer come with labels like “MathAnalytics,” “CodieNerd,” or “DataCritics” etc., because a lot of entries actually fit more than one labels. On the other hand, I want the headline to be beautified.
In order to find all the posts under, for example, “DataCritics,” go to the right top corner of this blog, click on “DataCritics” to retrieve all the posts under this category.
Biomedical research and clinical data are often collected on the same sample of data at different points in time. These data are called “longitudinal data.” (See the definition by BLS.) When performing supervised learning (e.g., SVM) on data of this kind, the impact of time-varying correlation of the features on the outcomes / predictions may be blurred. In order to smoothing out the temporal effect of the features, changes to the original learning algorithms are necessary.
In a study conducted by Center for Information Technology (CIT), and National Institutes on Aging (NIA) in National Institutes of Health (NIH), with some clinical data as the training data, a longitudinal support vector regression (LSVR) algorithm was presented, and shown to outperform other machine learning methods. [Du et. al. 2015] Their results were published in IEEE BIBM (Bioinformatics and Biomedicine) conference. Their work is adapted from an earlier work by Chen and Bowman. [Chen & Bowman 2011] The dataset is a longitudinal, because it contains N patients with p features, taken at T points in time.
Traditional support vector regression (SVR) is to solve the following optimization problem:
where is a hyperplane surface, under the constraints:
However, in LSVR, the data points are more complicated. For each patient s, its features at time t is given by a vector . The first goal of LSVR is to assign each patient a T-by-p matrix , and a T-by-1 vector , with an unknown parameter vector such that the constraints becomes:
where ‘s are additional regularization parameters. The parameters ‘s can be found by iteratively quadratic optimization. The constraints are handled with Lagrangian’s multipliers.
For details, please refer to [Du et. al. 2015]. This way decouples, or smoothes out, the temporal covariation within the patients. A better prediction can be made.
Previously, I have went through heuristically the description of topology using homology groups in this entry. [Ho 2015] This is the essence of algebraic topology. We describe the topology using Betti numbers, the rank of the homolog groups. What they mean can be summarized as: [Bubenik 2015]
“… homology in degree 0 describes the connectedness of the data; homology in degree 1 detects holes and tunnels; homology in degree 2 captures voids; and so on.
Concept of Persistence
However, in computational problems, it is the discrete points that we are dealing with. We formulate their connectedness through constructing complexes, as described by my another blog entry. [Ho 2015] From the Wolfram Demonstration that I quoted previously, connectedness depends on some parameters, such as the radii of points that are considered connected. Whether it is Čech Complex, RP complex, or Alpha complex, the idea is similar. With discrete data, therefore, there is no definite answer how the connectedness among the points are, as it depends on the parameters.
Therefore, the concept of persistence has been developed to tackle this problem. This is the core concept for computational topology. There are a lot of papers about persistence, but the most famous work is done by Zomorodian and Carlsson, who algebraically studied it. [Zomorodian & Carlsson 2005] The idea is that as one increases the radii of points, the complexes change, and so do the homology groups. By varying the radii, we can observe which topology persists.
From the diagram above, we can see that as the radii ε increase, the diagram becomes more connected. To understand the changes of homologies, there are a few ways. In the diagram above, barcodes represent the “life span” of a connected component as ε increases. The Betti numbers of a certain degree (0, 1, or 2 in this example) at a certain value of ε is the number of barcodes at that degree. For example, look at the left most vertical dashed line, , as there are 10 barcodes existing for . Note there are indeed 10 separate connected components. For the second leftmost vertical dashed line, (6 connected components), and (2 holes).
Another way is using the persistence diagram, basically plotting the “birth” and “death” times of all the barcodes above. For an explanation of persistence diagram, please refer to this blog entry by Sebastien Bubeck, [Bubeck 2013] or the paper by Fasy et. al. [Fasy et. al. 2014] Another way to describe persistent topology is the persistence landscape. [Bubenik 2015]
TDA Package in R
There are a lot of tools to perform topological data analysis. Ayasdi Core is a famous one. There are open sources C++ libraries such as Dionysus, or PHAT. There is a Python binding for Dionysus too.
There is a package in R that wraps Dionysus and PHAT, called TDA. To install it, simply open an R session, and enter
To load it, simply enter
We know that for a circle, , as it has on connected components, and a hole. Prepare the circle and store it in X by the function circleUnif:
X<- circleUnif(n=1000, r=1)
Then we can see a 2-dimensional circle like this:
To calculate the persistent homology, use the function gridDiag: