textmineR: a New Text Mining Package for R

Previously, I wrote an entry on text mining on R and Python, and did a comparison. However, the text mining package employed was tm for R. But it has some problems:

  1. The syntax is not natural for an experienced R users.
  2. tm uses simple_triplet_matrix from the slam library for document-term matrix (DTM) and term-occurrence matrix (TCM), which is not as widely used as dgCMatrix from the Matrix library.

Tommy Jones, a Ph.D. student in George Mason University, and a data scientist at Impact Research, developed an alternative text mining package called textmineR. He presented in a Stat Prog DC Meetup on April 27, 2016. It employed a better syntax, and dgCMatrix. All in all, it is a wrapper for a lot of existing R packages to facilitate the text mining process, like creating DTM matrices with stopwords or appropriate stemming/lemmatizing functions. Here is a sample code to create a DTM with the example from the previous entry:

library(tm)
library(textmineR)

texts <- c('I love Python.',
           'R is good for analytics.',
           'Mathematics is fun.')

dtm<-CreateDtm(texts,
               doc_names = c(1:length(texts)),
               ngram_window = c(1, 1),
               stopword_vec = c(tm::stopwords('english'), tm::stopwords('SMART')),
               lower = TRUE,
               remove_punctuation = TRUE,
               remove_numbers = TRUE
               )

The DTM is a sparse matrix:

3 x 6 sparse Matrix of class &amp;quot;dgCMatrix&amp;quot;
  analytics fun mathematics good python love
1         .   .           .    .      1    1
2         1   .           .    1      .    .
3         .   1           1    .      .    .

On the other hand, it wraps text2vec, an R package that wraps the word-embedding algorithm named gloVe. And it wraps a number of topic modeling algorithms, such as latent Dirichlet allocation (LDA) and correlated topic models (CTM).

In addition, it contains a parallel computing loop function called TmParallelApply, analogous to the original R parallel loop function mclapply, but TmParallelApply works on Windows as well.

textmineR is an open-source project, with source code available on github, which contains his example codes.

  • Kwan-Yuet Ho, “R or Python on Text Mining,” WordPress (2015). [WordPress]
  • textmineR: Functions for Text Mining and Topic Modeling. [CRAN]
  • Github: textmineR. [Github] (Example codes: here)
  • Tommy Jones.
  • Tommy Jones, “textmineR with R: NLP with R,” Stat Prog DC Meetup. [MeetUp] (Its KeyNote presentation: here)
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s