Combining the Best of All Worlds

There are many learning algorithms that perform classification tasks. However, very often the situation is that one classifier is better on certain data points, but another is better on other. It would be nice if there are ways to combine the best of all these available classifiers.

Voting

The simplest way of combining classifiers to improve the classification is democracy: voting. When there are n classifiers that output the same classes, the result can be simply cast by a democratic vote. This method works quite well in many problems. Sometimes, we may need to give various weights to different classifiers to improve the performance.

Bagging and Boosting

Sometimes we can generate many classifiers with the handful amount of data available with bagging and boosting. By bagging and boosting, different classifiers are built with the same learning algorithm but with different datasets. “Bagging builds different versions of the training set by sampling with replacement,” and “boosting obtains the different training sets by focusing on the instances that are misclassified by the previously trained classifiers.” [Sesmero etal. 2015]

Fusion

Performance of classifiers depends not only on the learning algorithms and the data, but also the set of features used. While feature generation itself is a bigger and a more important problem (not to be discussed), we do have various ways to combine different features. Sometimes we separate features into different classifiers in which the answers are to be combined, or combine all these features into one classifier. The former is called late fusion, while the latter early fusion.

Stacking

We can also treat the prediction results of various classifiers as features of another classifiers. It is called stacking. [Wolpert 1992] “Stacking generates the members of the Stacking ensemble using several learning algorithms and subsequently uses another algorithm to learn how to combine their outputs.” [Sesmero etal. 2015] Some recent implementation in computational epidemiology employ stacking as well. [Russ et. al. 2016]

Hidden Topics and Embedding

There is also a special type of feature generation of one classifier, using hidden topic or embedding as the latent vectors. We can generate a set of latent topics according to the data available using latent Dirichlet allocation (LDA) or correlated topic models (CTM), and describe each datasets using these topics as the input to another classifier. [Phan et. al. 2011] Another way is to represent the data using embedding vectors (such as time-series embedding, Word2Vec, or LDA2Vec etc.) as the input of another classifier. [Czerny 2015]

Continue reading “Combining the Best of All Worlds”

Advertisements

textmineR: a New Text Mining Package for R

Previously, I wrote an entry on text mining on R and Python, and did a comparison. However, the text mining package employed was tm for R. But it has some problems:

  1. The syntax is not natural for an experienced R users.
  2. tm uses simple_triplet_matrix from the slam library for document-term matrix (DTM) and term-occurrence matrix (TCM), which is not as widely used as dgCMatrix from the Matrix library.

Tommy Jones, a Ph.D. student in George Mason University, and a data scientist at Impact Research, developed an alternative text mining package called textmineR. He presented in a Stat Prog DC Meetup on April 27, 2016. It employed a better syntax, and dgCMatrix. All in all, it is a wrapper for a lot of existing R packages to facilitate the text mining process, like creating DTM matrices with stopwords or appropriate stemming/lemmatizing functions. Here is a sample code to create a DTM with the example from the previous entry:

library(tm)
library(textmineR)

texts <- c('I love Python.',
           'R is good for analytics.',
           'Mathematics is fun.')

dtm<-CreateDtm(texts,
               doc_names = c(1:length(texts)),
               ngram_window = c(1, 1),
               stopword_vec = c(tm::stopwords('english'), tm::stopwords('SMART')),
               lower = TRUE,
               remove_punctuation = TRUE,
               remove_numbers = TRUE
               )

The DTM is a sparse matrix:

3 x 6 sparse Matrix of class &amp;quot;dgCMatrix&amp;quot;
  analytics fun mathematics good python love
1         .   .           .    .      1    1
2         1   .           .    1      .    .
3         .   1           1    .      .    .

On the other hand, it wraps text2vec, an R package that wraps the word-embedding algorithm named gloVe. And it wraps a number of topic modeling algorithms, such as latent Dirichlet allocation (LDA) and correlated topic models (CTM).

In addition, it contains a parallel computing loop function called TmParallelApply, analogous to the original R parallel loop function mclapply, but TmParallelApply works on Windows as well.

textmineR is an open-source project, with source code available on github, which contains his example codes.

Continue reading “textmineR: a New Text Mining Package for R”

Blog at WordPress.com.

Up ↑