There has been a lot of methods for natural language processing and text mining. However, in tweets, surveys, Facebook, or many online data, texts are short, lacking data to build enough information. Traditional bag-of-words (BOW) model gives sparse vector representation.

Semantic relations between words are important, because we usually do not have enough data to capture the similarity between words. We do not want “drive” and “drives,” or “driver” and “chauffeur” to be completely different.

The relation between or order of words become important as well. Or we want to capture the concepts that may be correlated in our training dataset.

We have to represent these texts in a special way and perform supervised learning with traditional machine learning algorithms or deep learning algorithms.

This package `shorttext‘ was designed to tackle all these problems. It is not a completely new invention, but putting everything known together. It contains the following features:

• example data provided (including subject keywords and NIH RePORT);
• text preprocessing;
• pre-trained word-embedding support;
• gensim topic models (LDA, LSI, Random Projections) and autoencoder;
• topic model representation supported for supervised learning using scikit-learn;
• cosine distance classification; and
• neural network classification (including ConvNet, and C-LSTM).

Readers can refer this to the documentation.

There are many tasks that involve coding, for example, putting kids into groups according to their age, labeling the webpages about their kinds, or putting students in Hogwarts into four colleges… And researchers or lawyers need to code people, according to their filled-in information, into occupations. Melissa Friesen, an investigator in Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NCI), National Institutes of Health (NIH), saw the need of large-scale coding. Many researchers are dealing with big data concerning epidemiology. She led a research project, in collaboration with Office of Intramural Research (OIR), Center for Information Technology (CIT), National Institutes of Health (NIH), to develop an artificial intelligence system to cope with the problem. This leads to a publicly available tool called SOCcer, an acronym for “Standardized Occupation Coding for Computer-assisted Epidemiological Research.” (URL: http://soccer.nci.nih.gov/soccer/)

The system was initially developed in an attempt to find the correlation between the onset of cancers and other diseases and the occupation. “The application is not intended to replace expert coders, but rather to prioritize which job descriptions would benefit most from expert review,” said Friesen in an interview. She mainly works with Daniel Russ in CIT.

SOCcer takes job title, industry codes (in terms of SIC, Standard Industrial Classification), and job duties, and gives an occupational code called SOC 2010 (Standard Occupational Classification), used by U. S. federal government agencies. The data involves short text, often messy. There are 840 codes in SOC 2010 systems. Conventional natural language processing (NLP) methods may not apply. Friesen, Russ, and Kwan-Yuet (Stephen) Ho (also in OIR, CIT; a CSRA staff) use fuzzy logic, and maximum entropy (maxent) methods, with some feature engineering, to build various classifiers. These classifiers are aggregated together, as in stacked generalization (see my previous entry), using logistic regression, to give a final score.

SOCcer has a companion software, called SOCAssign, for expert coders to prioritize the codings. It was awarded with DCEG Informatics Tool Challenge 2015. SOCcer itself was awarded in 2016. And the SOCcer team was awarded for Scientific Award of Merit by CIT/OCIO in 2016 as well (see this). Their work was published in Occup. Environ. Med.

Biomedical research and clinical data are often collected on the same sample of data at different points in time. These data are called “longitudinal data.” (See the definition by BLS.) When performing supervised learning (e.g., SVM) on data of this kind, the impact of time-varying correlation of the features on the outcomes / predictions may be blurred. In order to smoothing out the temporal effect of the features, changes to the original learning algorithms are necessary.

In a study conducted by Center for Information Technology (CIT), and National Institutes on Aging (NIA) in National Institutes of Health (NIH), with some clinical data as the training data, a longitudinal support vector regression (LSVR) algorithm was presented, and shown to outperform other machine learning methods. [Du et. al. 2015] Their results were published in IEEE BIBM (Bioinformatics and Biomedicine) conference. Their work is adapted from an earlier work by Chen and Bowman. [Chen & Bowman 2011] The dataset is a longitudinal, because it contains N patients with p features, taken at T points in time.

Traditional support vector regression (SVR) is to solve the following optimization problem:

$\text{min} \frac{1}{2} || \mathbf{w} ||^2$,

where $\mathbf{w}$ is a hyperplane surface, under the constraints:

$y_i - \mathbf{w^T} \Phi (\mathbf{x}_i) - b \leq \epsilon$,
$\mathbf{w^T} \Phi (\mathbf{x}_i) + b - y_i \leq \epsilon$.

However, in LSVR, the data points are more complicated. For each patient s, its features at time t is given by a vector $\mathbf{x}_s^{(t)}$. The first goal of LSVR is to assign each patient a T-by-p matrix $\tilde{\mathbf{x}}_s = \left[ \mathbf{x}_s^{(1)}, \mathbf{x}_s^{(2)}, \ldots, \mathbf{x}_s^{(T)} \right]^T$, and a T-by-1 vector $\tilde{y}_s =\left[ y_s^{(1)}, y_s^{(2)}, \ldots, y_s^{(T)} \right]^T$, with an unknown parameter vector $\mathbf{\beta} = \left[ 1, \beta_1, \beta_2, \ldots, \beta_{T-1} \right]^T$ such that the constraints becomes:

$\tilde{y}_i^T \beta - \langle \mathbf{w^T}, \mathbf{\tilde{x}_i}^T \beta \rangle - b \leq \epsilon + \xi_i$,
$\langle \mathbf{w^T}, \mathbf{\tilde{x}_i}^T \beta \rangle+ b -\tilde{y}_i^T \beta \leq \epsilon + \xi_i^{*}$,

where $\xi_i$‘s are additional regularization parameters. The parameters $\beta$‘s can be found by iteratively quadratic optimization. The constraints are handled with Lagrangian’s multipliers.

For details, please refer to [Du et. al. 2015]. This way decouples, or smoothes out, the temporal covariation within the patients. A better prediction can be made.

What should a data scientist know? What are the core skills of a data scientist? I have not seen another job title so vague and ambiguous that arouses so many debates and discussions. BD2K (Big Data to Knowledge) Centers of NIH (National Institutes of Health) [Ohno-Machado 2014] have issued funding to a few tertiary colleges in the United States to develop data science curricula, which carries on such discussions.

This is an interdisciplinary field. Around 15 years ago, I was still a matriculation student in Hong Kong. The University of Hong Kong (HKU) started a major called bioinformatics. People were puzzled about what it was indeed, because it looked like a melting pot of several unrelated disciplines (which actually a lot of freshmen complained as they did not understand the purpose of the undergraduate program). But we now understand how it is important.

So what should the students learn? It was suggested in the following figure:

You can see that the core competencies include statistics, machine learning, software engineering, reproducible research, and data visualization. Some of them are math and computer, some sciences, and some arts. And of course, individual data scientist jobs require the corresponding business knowledge.

Honestly, I do not excel in all of them. I have a physics background, which makes it easy for me to learn machine learning and research. Software engineering is not hard to pick up. But statistics is an alien theory to me, and visualization requires the artistic sense that I don’t possess.

Anyway, a lot to learn. Stay humble.

Icon of HTJoinSolver

Big data is also impacting biomedical research and clinical processes. In 1987,  Susumu Tonegawa was awarded the Nobel Prize in Physiology or Medicine “for his discovery of the genetic principle for generation of antibody diversity.” The mechanism that forms the diverse antibodies in our bodies is called V(D)J recombination, or less commonly known as somatic recombination. [Tonegawa 1983] Since then, there are a lot of bioinformatician work on this.

Antibody (taken from Wikipedia)

Antibodies are often referred to as immunoglobulin (Ig). It is a combination of three gene segments: the Variety (V), the Diversity (D), and the Joining (J) gene segments. IMGT standardized the types of different segments of Ig.

There are a number of computational tools that partition the different segments of Ig. One of the most promising tools is the JoinSolver, developed by the collaboration between Center for Information Technology (CIT) and National Institute of Allergic and Infectious Disease (NIAID) of National Institutes of Health (NIH). [Souto-Carneiro, Longo, Russ, Sun & Lipsky 2004] However, as JoinSolver was not designed to handle insertions and deletions on gene sequences, a further improvement was needed. And the vast volume of sequence data urged a tool that can handle them with a more efficient algorithm, and run on multi-processing systems. It is how the idea of HTJoinSolver formed, where HT stands for “high-throughput.” Russ, Ho and Longo published their work in BMC Bioinformatics [Russ, Ho & Longo 2015], and the tool is available on their website.

HTJoinSolver is a collaboration of Division of Computational Biosciences (DCB) of CIT, and NIAID in NIH. HTJoinSolver partitions an Ig using an efficient dynamic programming (DP) algorithm that employs prior biological information of Ig. Usual DP algorithms, such as Smith-Waterman algorithm, compare two sequences with a full matrix of size mxn, where m and n are the lengths of the two sequences. (Refer to [Durbin, Eddy, Krogh & Mitchison 1998] for more details.) However, with a known motif in the V segment of Ig, TATTAGTGT, HTJoinSolver speeds up the comparison by filling the diagonals only, unless there are some variations that require full computation in some small regions of the matrix, as shown below:

Approximate DP algorithm in HTJoinSolver, taken from [Russ, Ho & Longo 2015]

After partitioning, the tool further analyzes the sequences, such as CDR3, excision, and mutation rate. This tool identifies various segments of Ig with extremely high accuracies even if the mutation probabilities of the Ig’s are as high as 30%. It speeds up the research and clinical process of immunologists and clinicians.

This is a very good example of big data in biomedical applications.