Icon of HTJoinSolver
Big data is also impacting biomedical research and clinical processes. In 1987, Susumu Tonegawa was awarded the Nobel Prize in Physiology or Medicine “for his discovery of the genetic principle for generation of antibody diversity.” The mechanism that forms the diverse antibodies in our bodies is called V(D)J recombination, or less commonly known as somatic recombination. [Tonegawa 1983] Since then, there are a lot of bioinformatician work on this.
Antibodies are often referred to as immunoglobulin (Ig). It is a combination of three gene segments: the Variety (V), the Diversity (D), and the Joining (J) gene segments. IMGT standardized the types of different segments of Ig.
There are a number of computational tools that partition the different segments of Ig. One of the most promising tools is the JoinSolver, developed by the collaboration between Center for Information Technology (CIT) and National Institute of Allergic and Infectious Disease (NIAID) of National Institutes of Health (NIH). [Souto-Carneiro, Longo, Russ, Sun & Lipsky 2004] However, as JoinSolver was not designed to handle insertions and deletions on gene sequences, a further improvement was needed. And the vast volume of sequence data urged a tool that can handle them with a more efficient algorithm, and run on multi-processing systems. It is how the idea of HTJoinSolver formed, where HT stands for “high-throughput.” Russ, Ho and Longo published their work in BMC Bioinformatics [Russ, Ho & Longo 2015], and the tool is available on their website.
HTJoinSolver is a collaboration of Division of Computational Biosciences (DCB) of CIT, and NIAID in NIH. HTJoinSolver partitions an Ig using an efficient dynamic programming (DP) algorithm that employs prior biological information of Ig. Usual DP algorithms, such as Smith-Waterman algorithm, compare two sequences with a full matrix of size mxn, where m and n are the lengths of the two sequences. (Refer to [Durbin, Eddy, Krogh & Mitchison 1998] for more details.) However, with a known motif in the V segment of Ig, TATTAGTGT, HTJoinSolver speeds up the comparison by filling the diagonals only, unless there are some variations that require full computation in some small regions of the matrix, as shown below:
Approximate DP algorithm in HTJoinSolver, taken from [Russ, Ho & Longo 2015]
After partitioning, the tool further analyzes the sequences, such as CDR3, excision, and mutation rate. This tool identifies various segments of Ig with extremely high accuracies even if the mutation probabilities of the Ig’s are as high as 30%. It speeds up the research and clinical process of immunologists and clinicians.
This is a very good example of big data in biomedical applications.
- The Nobel Prize in Physiology or Medicine 1987 [http://www.nobelprize.org/nobel_prizes/medicine/laureates/1987/]
- S. Tonegawa, “Somatic generation of antibody diversity”, Nature 302, 575 – 581 (1983).
- The International Immunogenetics information system (IMGT).
- JoinSolver, developed by Center for Information Technology (CIT) and National Institute of Allergic and Infectious Disease (NIAID) of National Institutes of Health (NIH).
- M. M. Souto-Carneiro, N. S. Longo, D. E. Russ, H. W. Sun, P. E. Lipsky, “Characterization of the human Ig heavy chain antigen binding complementarity determining region 3 using a newly developed software algorithm, JOINSOLVER.”, J Immunol. 172 (11), pp. 6790-802 (2004).
- R. Durbin, S. R. Eddy, A. Krogh, G. Mitchison, “Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids”, Cambridge University Press (1998).
- D. E. Russ, K.-Y. Ho, N. S. Longo, “HTJoinSolver: Human immunoglobulin VDJ partitioning using approximate dynamic programming constrained by conserved motifs”, BMC Bioinformatics 16, 170 (2015).