Document-Term Matrix: Text Mining in R and Python

In text mining, it is important to create the document-term matrix (DTM) of the corpus we are interested in. A DTM is basically a matrix, with documents designated by rows and words by columns, that the elements are the counts or the weights (usually by tf-idf). Subsequent analysis is usually based creatively on DTM.

Exploring with DTM therefore becomes an important issues with a good text-mining tool. How do we perform exploratory data analysis on DTM using R and Python? We will demonstrate it using the data set of U. S. Presidents’ Inaugural Address, preprocessed, and can be downloaded here.

R: textmineR

In R, we can use the package textmineR, which has been in introduced in a previous post. Together with other packages such as dplyr (for tidy data analysis) and snowBall (for stemming), load all of them at the beginning:


Load the datasets:

usprez.df<- read.csv('inaugural.csv', stringsAsFactors = FALSE)

Then we create the DTM, while we remove all digits and punctuations, make all letters lowercase, and stem all words using Porter stemmer.

dtm<- CreateDtm(usprez.df$speech,
                doc_names = usprez.df$yrprez,
                ngram_window = c(1, 1),
                lower = TRUE,
                remove_punctuation = TRUE,
                remove_numbers = TRUE,
                stem_lemma_function = wordStem)

Then defining a set of functions:

get.doc.tokens<- function(dtm, docid)
  dtm[docid, ] %>% %>% rename(count=".") %>%
  mutate(token=row.names(.)) %>% arrange(-count)

get.token.occurrences<- function(dtm, token)
  dtm[, token] %>% %>% rename(count=".") %>%
  mutate(token=row.names(.)) %>% arrange(-count)<- function(dtm, token) dtm[, token] %>% sum

get.doc.freq<- function(dtm, token)
  dtm[, token] %>% %>% rename(count=".") %>%
  filter(count>0) %>% pull(count) %>% length

Then we can happily extract information. For example, if we want to get the top-most common words in 2009’s Obama’s speech, enter:

dtm %>% get.doc.tokens('2009-Obama') %>% head(10)

Or which speeches have the word “change”: (but need to stem the word before extraction)

dtm %>% get.token.occurrences(wordStem('change')) %>% head(10)

You can also get the total number of occurrence of the words by:

dtm %>% get.doc.freq(wordStem('change'))   # gives 28

Python: shorttext

In Python, similar things can be done using the package shorttext, described in a previous post. It uses other packages such as pandas and stemming. Load all packages first:

import shorttext
import numpy as np
import pandas as pd
from stemming.porter import stem

import re

And define the preprocessing pipelines:

pipeline = [lambda s: re.sub('[^\w\s]', '', s),
            lambda s: re.sub('[\d]', '', s),
            lambda s: s.lower(),
            lambda s: ' '.join(map(stem, shorttext.utils.tokenize(s)))
txtpreproceesor = shorttext.utils.text_preprocessor(pipeline)

The function <code>txtpreprocessor</code> above perform the functions we talked about in R.

Load the dataset:

usprezdf = pd.read_csv('inaugural.csv')

The corpus needs to be preprocessed before putting into the DTM:

docids = list(usprezdf['yrprez'])    # defining document IDs
corpus = [txtpreproceesor(speech).split(' ') for speech in usprezdf['speech']]

Then create the DTM:

dtm = shorttext.utils.DocumentTermMatrix(corpus, docids=docids, tfidf=False)

Then we do the same thing as we have done above. To get the top-most common words in 2009’s Obama’s speech, enter:


Or we look up which speeches have the word “change”:


Or to get the document frequency of the word:


They Python and R codes give different document frequencies probably because the two stemmers work slightly differently.

Continue reading “Document-Term Matrix: Text Mining in R and Python”

rJava: Running Java from R, and Building R Packages Wrapping a .jar

While performing exploratory analysis, R is a good tool, although we sometimes want to invoke some stable Java tools. It is what the R Package rJava is for. To install it, simply enter on the R Console:


And to load it, enter:


As a simple demonstration, we find the length of a strength. Start the JVM, enter:


Then we create an instance of a Java string, and find its length as follow:

s <- .jnew('java/lang/String', 'Hello World!')
.jcall(s, 'I', 'length')

The first line, with the function .jnew, create a Java string instance. It is safe to put the full package path of the class. The second line, with the function .jcall, call the method length() for String. The second parameter, ‘I’, indicates it returns an integer. The type has to follow the JNI notation for native types. If it is an integer double array, type ‘I[[‘. If it is not a native class like String, use its total package path.

Example: Peter Norvig’s Spell Corrector Written in Scala

What should we do if we already have a .jar file we want to wrap? I would start with a simple one. Two years ago, I implemented Peter Norvig’s spell corrector (see his article) in Scala (which is a language for Java Virtual Machine (JVM) as well, see this entry), and posted on my Github repository: stephenhky/SpellCorrector. You may check out to your Eclipse or IntelliJ IDEA, and build a .jar file. (Or you can download the .jar file here.) For the program to run, do not forget to download his corpus named big.txt. The project has a class called SpellCorrector, which only the necessary codes are listed below:

package home.kwyho.spellcheck


import scala.collection.mutable.Map

class SpellCorrector {
 var wordCounts : Map[String, Int] = Map()
 val alphabets = ('a' to 'z').toSet

 def train(trainFile : File) = {
    val lines = Source.fromFile(trainFile) mkString
    val wordREPattern = "[A-Za-z]+"
    wordREPattern.r.findAllIn(lines).foreach( txtWord => {
       val word = txtWord.toLowerCase
       if (wordCounts.keySet contains(word)) {
          wordCounts(word) = wordCounts(word)+1
       } else {
          wordCounts += (word -> 1)

// other codes here ....

 def correct(wrongSpelling: String) : String = {
    val edit0words = Set(wrongSpelling) intersect wordCounts.keySet
    if (edit0words.size>0) return edit0words.maxBy( s => wordCounts(s))
    val edit1words = getEditOneSpellings(wrongSpelling)
    if (edit1words.size>0) return edit1words.maxBy( s => wordCounts(s))
    val edit2words = getEditTwoSpellings(wrongSpelling)
    edit2words.maxBy( s => wordCounts(s))

Putting the .jar file and big.txt into the same folder. Then initialize the JVM, and add the .jar file into the classpath:


Create an instance for SpellChecker, and train the corpus big.txt. Remember to put the whole package path as the class:

corrector <- .jnew('home/kwyho/spellcheck/SpellCorrector')
bigfile <- .jnew('java/io/File', 'big.txt')
.jcall(corrector, 'V', 'train', bigfile)

The first line create a SpellChecker instance, the second line create a File instance for big.txt, and the third line call the train() method. The JNI notation ‘V’ denotes ‘void.’ Entering ‘corrector’ will give a string indicates it is a Java object:

[1] "Java-Object{home.kwyho.spellcheck.SpellCorrector@5812f9ee}"

Then we can do spell correction by designing the following function:

correct<-function(word) {
   javaStrtext <- .jnew('java/lang/String', word)
   .jcall(corrector, 'Ljava/lang/String;', 'correct', javaStrtext)

Then you can easily perform spell correction as follow:


Some people put .class file instead of .jar file. In that case, you need to put the compiled Java class into the working directory. You can refer to an entry in Darren Wilkinson’s research blog for more details.

Building an R Package

It is another matter to build an R package that wraps a .jar file. In Hilary Parker’s entry and my previous entry, there are details about building an R package with roxygen2. There is also a¬†documentation¬†written by Tobias Verbeke.

So to start building it, in RStudio, start a project by clicking on the button “Project: (None)” on the top right corner of RStudio, choose “New Directory,” and then “R Package.” Type in the name (“RSpellCorrection” here), and specify a directory. Then click “Create Project.” A new RStudio window will show up. From the menu bar, choose “Build” > “Configure Build Tools”. Then click on “Configure…” button. There is a dialog box coming out. Check everything, and click “OK”.


The instructions above are rather detailed. But starting from now, I will skip the procedural details. Then start a file named, say, onLoad.R under the subfolder R/, and put the following codes there:

.onLoad <- function(libname, pkgname) {
  .jpackage(pkgname, lib.loc=libname)

This is a hook function that R will call when this package is being loaded. You must include it. Then in the file named DESCRIPTION, put in the relevant information:

Package: RSpellCorrection
Type: Package
Title: Spell Correction, Scala implementation run in R
Version: 0.1.0
Author: Kwan-Yuet Ho, Ph.D.
Maintainer: Kwan-Yuet Ho, Ph.D. <>
Description: Implementation of Peter Norvig's spell corrector in Scala, wrapped in R
License: N/A
LazyData: TRUE
RoxygenNote: 5.0.1
Depends: R(>= 2.7.0), rJava (>= 0.5-0)

Note the last line (“Depends…”), which you have to include because R will parse this line, and load rJava automatically. Remember there is a space between “>=” and the version number. Do not use library function in your code.

First, create a subfolder inst/java, and put the .jar file there.

Then start a file, called correct.R under subfolder R/, and write a function:

#' Retrieve a Java instance of SpellCorrector.
#' Retrieve a Java instance of SpellCorrector, with the training file
#' specified. Language model is trained before the instance is returned.
#' The spell corrector is adapted from Peter Norvig's demonstration.
#' @param filepath Path of the corpus.
#' @return a Java instance of SpellCorrector
#' @export
getcorrector<-function(filepath='big.txt') {
    .jaddLibrary('spellchecker', 'inst/java/spellcorrector.jar')
    corrector<- .jnew('home/kwyho/spellcheck/SpellCorrector')
    bigfile<- .jnew('java/io/File', filepath)
    .jcall(corrector, 'V', 'train', bigfile)

This return a Java instance of SpellCorrector as in previous section. There is a large block of text above the function, and they are for producing manual using roxygen2. The tag “@export” is important to tell roxygen2 to make this function visible to the users.

Then add another function:

#' Correct spelling.
#' Given an instance of SpellCorrector, return the most probably
#' corrected spelling of the given word.
#' @param word A token.
#' @param corrector A Java instance of SpellCorrector, given by \code{getcorrector}.
#' @return Corrected spelling
#' @export
correct<-function(word, corrector) {
    javaStrtext <- .jnew('java/lang/String', word)
    .jcall(corrector, 'Ljava/lang/String;', 'correct', javaStrtext)

Then click “Build & Reload” button on the “Build” Tab:


Then the package will be built, and reloaded. The manual documents (*.Rd) will be produced as well. You can then play with the spell corrector again like this:


Assuming you put this into the Github repository like I did (link here), you can install the new R package like this:


Then the R package will be downloaded, and installed for use. Or another option is that if you wish to install from your local directory, just enter:

install.packages('<path-to>/RSpellCorrection', repos = NULL, type = 'source')

A complete version of this R package can be found in my Github repository: stephenhky/RSpellCorrection. You may want to add a into the repository, which you need to know the Markdown language by referring to Lei Feng’s blog entry.

Continue reading “rJava: Running Java from R, and Building R Packages Wrapping a .jar”

Create a free website or blog at

Up ↑