- The syntax is not natural for an experienced R users.
- tm uses simple_triplet_matrix from the slam library for document-term matrix (DTM) and term-occurrence matrix (TCM), which is not as widely used as dgCMatrix from the Matrix library.
Tommy Jones, a Ph.D. student in George Mason University, and a data scientist at Impact Research, developed an alternative text mining package called textmineR. He presented in a Stat Prog DC Meetup on April 27, 2016. It employed a better syntax, and dgCMatrix. All in all, it is a wrapper for a lot of existing R packages to facilitate the text mining process, like creating DTM matrices with stopwords or appropriate stemming/lemmatizing functions. Here is a sample code to create a DTM with the example from the previous entry:
library(tm) library(textmineR) texts <- c('I love Python.', 'R is good for analytics.', 'Mathematics is fun.') dtm<-CreateDtm(texts, doc_names = c(1:length(texts)), ngram_window = c(1, 1), stopword_vec = c(tm::stopwords('english'), tm::stopwords('SMART')), lower = TRUE, remove_punctuation = TRUE, remove_numbers = TRUE )
The DTM is a sparse matrix:
3 x 6 sparse Matrix of class &quot;dgCMatrix&quot; analytics fun mathematics good python love 1 . . . . 1 1 2 1 . . 1 . . 3 . 1 1 . . .
On the other hand, it wraps text2vec, an R package that wraps the word-embedding algorithm named gloVe. And it wraps a number of topic modeling algorithms, such as latent Dirichlet allocation (LDA) and correlated topic models (CTM).
In addition, it contains a parallel computing loop function called TmParallelApply, analogous to the original R parallel loop function mclapply, but TmParallelApply works on Windows as well.