rJava: Running Java from R, and Building R Packages Wrapping a .jar

While performing exploratory analysis, R is a good tool, although we sometimes want to invoke some stable Java tools. It is what the R Package rJava is for. To install it, simply enter on the R Console:

install.packages('rJava')

And to load it, enter:

library(rJava)

As a simple demonstration, we find the length of a strength. Start the JVM, enter:

.jinit('.')

Then we create an instance of a Java string, and find its length as follow:

s <- .jnew('java/lang/String', 'Hello World!')
.jcall(s, 'I', 'length')

The first line, with the function .jnew, create a Java string instance. It is safe to put the full package path of the class. The second line, with the function .jcall, call the method length() for String. The second parameter, ‘I’, indicates it returns an integer. The type has to follow the JNI notation for native types. If it is an integer double array, type ‘I[[‘. If it is not a native class like String, use its total package path.

Example: Peter Norvig’s Spell Corrector Written in Scala

What should we do if we already have a .jar file we want to wrap? I would start with a simple one. Two years ago, I implemented Peter Norvig’s spell corrector (see his article) in Scala (which is a language for Java Virtual Machine (JVM) as well, see this entry), and posted on my Github repository: stephenhky/SpellCorrector. You may check out to your Eclipse or IntelliJ IDEA, and build a .jar file. (Or you can download the .jar file here.) For the program to run, do not forget to download his corpus named big.txt. The project has a class called SpellCorrector, which only the necessary codes are listed below:

package home.kwyho.spellcheck

/*
 Reference: http://norvig.com/spell-correct.html
 */

import java.io.File
import scala.io.Source
import scala.collection.mutable.Map

class SpellCorrector {
 var wordCounts : Map[String, Int] = Map()
 val alphabets = ('a' to 'z').toSet

 def train(trainFile : File) = {
    val lines = Source.fromFile(trainFile) mkString
    val wordREPattern = "[A-Za-z]+"
    wordREPattern.r.findAllIn(lines).foreach( txtWord => {
       val word = txtWord.toLowerCase
       if (wordCounts.keySet contains(word)) {
          wordCounts(word) = wordCounts(word)+1
       } else {
          wordCounts += (word -> 1)
       }
    })
 }

// other codes here ....

 def correct(wrongSpelling: String) : String = {
    val edit0words = Set(wrongSpelling) intersect wordCounts.keySet
    if (edit0words.size>0) return edit0words.maxBy( s => wordCounts(s))
    val edit1words = getEditOneSpellings(wrongSpelling)
    if (edit1words.size>0) return edit1words.maxBy( s => wordCounts(s))
    val edit2words = getEditTwoSpellings(wrongSpelling)
    edit2words.maxBy( s => wordCounts(s))
 }
}

Putting the .jar file and big.txt into the same folder. Then initialize the JVM, and add the .jar file into the classpath:

.jinit('.')
.jaddClassPath('spellcorrector.jar')

Create an instance for SpellChecker, and train the corpus big.txt. Remember to put the whole package path as the class:

corrector <- .jnew('home/kwyho/spellcheck/SpellCorrector')
bigfile <- .jnew('java/io/File', 'big.txt')
.jcall(corrector, 'V', 'train', bigfile)

The first line create a SpellChecker instance, the second line create a File instance for big.txt, and the third line call the train() method. The JNI notation ‘V’ denotes ‘void.’ Entering ‘corrector’ will give a string indicates it is a Java object:

[1] "Java-Object{home.kwyho.spellcheck.SpellCorrector@5812f9ee}"

Then we can do spell correction by designing the following function:

correct<-function(word) {
   javaStrtext <- .jnew('java/lang/String', word)
   .jcall(corrector, 'Ljava/lang/String;', 'correct', javaStrtext)
}

Then you can easily perform spell correction as follow:

img

Some people put .class file instead of .jar file. In that case, you need to put the compiled Java class into the working directory. You can refer to an entry in Darren Wilkinson’s research blog for more details.

Building an R Package

It is another matter to build an R package that wraps a .jar file. In Hilary Parker’s entry and my previous entry, there are details about building an R package with roxygen2. There is also a¬†documentation¬†written by Tobias Verbeke.

So to start building it, in RStudio, start a project by clicking on the button “Project: (None)” on the top right corner of RStudio, choose “New Directory,” and then “R Package.” Type in the name (“RSpellCorrection” here), and specify a directory. Then click “Create Project.” A new RStudio window will show up. From the menu bar, choose “Build” > “Configure Build Tools”. Then click on “Configure…” button. There is a dialog box coming out. Check everything, and click “OK”.

img1.png

The instructions above are rather detailed. But starting from now, I will skip the procedural details. Then start a file named, say, onLoad.R under the subfolder R/, and put the following codes there:

.onLoad <- function(libname, pkgname) {
  .jpackage(pkgname, lib.loc=libname)
}

This is a hook function that R will call when this package is being loaded. You must include it. Then in the file named DESCRIPTION, put in the relevant information:

Package: RSpellCorrection
Type: Package
Title: Spell Correction, Scala implementation run in R
Version: 0.1.0
Author: Kwan-Yuet Ho, Ph.D.
Maintainer: Kwan-Yuet Ho, Ph.D. <stephenhky@yahoo.com.hk>
Description: Implementation of Peter Norvig's spell corrector in Scala, wrapped in R
License: N/A
LazyData: TRUE
RoxygenNote: 5.0.1
Depends: R(>= 2.7.0), rJava (>= 0.5-0)

Note the last line (“Depends…”), which you have to include because R will parse this line, and load rJava automatically. Remember there is a space between “>=” and the version number. Do not use library function in your code.

First, create a subfolder inst/java, and put the .jar file there.

Then start a file, called correct.R under subfolder R/, and write a function:

#' Retrieve a Java instance of SpellCorrector.
#'
#' Retrieve a Java instance of SpellCorrector, with the training file
#' specified. Language model is trained before the instance is returned.
#' The spell corrector is adapted from Peter Norvig's demonstration.
#'
#' @param filepath Path of the corpus.
#' @return a Java instance of SpellCorrector
#' @export
getcorrector<-function(filepath='big.txt') {
    .jaddLibrary('spellchecker', 'inst/java/spellcorrector.jar')
    .jaddClassPath('inst/java/spellcorrector.jar')
    corrector<- .jnew('home/kwyho/spellcheck/SpellCorrector')
    bigfile<- .jnew('java/io/File', filepath)
    .jcall(corrector, 'V', 'train', bigfile)
    return(corrector)
}

This return a Java instance of SpellCorrector as in previous section. There is a large block of text above the function, and they are for producing manual using roxygen2. The tag “@export” is important to tell roxygen2 to make this function visible to the users.

Then add another function:

#' Correct spelling.
#'
#' Given an instance of SpellCorrector, return the most probably
#' corrected spelling of the given word.
#'
#' @param word A token.
#' @param corrector A Java instance of SpellCorrector, given by \code{getcorrector}.
#' @return Corrected spelling
#' @export
correct<-function(word, corrector) {
    javaStrtext <- .jnew('java/lang/String', word)
    .jcall(corrector, 'Ljava/lang/String;', 'correct', javaStrtext)
}

Then click “Build & Reload” button on the “Build” Tab:

img2.png

Then the package will be built, and reloaded. The manual documents (*.Rd) will be produced as well. You can then play with the spell corrector again like this:

img3

Assuming you put this into the Github repository like I did (link here), you can install the new R package like this:

library(devtools)
install_github('stephenhky/RSpellCorrection')

Then the R package will be downloaded, and installed for use. Or another option is that if you wish to install from your local directory, just enter:

install.packages('<path-to>/RSpellCorrection', repos = NULL, type = 'source')

A complete version of this R package can be found in my Github repository: stephenhky/RSpellCorrection. You may want to add a README.md into the repository, which you need to know the Markdown language by referring to Lei Feng’s blog entry.

Continue reading “rJava: Running Java from R, and Building R Packages Wrapping a .jar”

Choices of Tools

When dealing with data analytics, what kind of things do we usually spend most of our time on?

I would say data cleaning and modeling.

Therefore, it is not merely software development. While we sometimes spend a lot of time in software architecture (which is important), before doing that, we have to explore what we want. Very often, data come in various formats, or we need to manually clean them. And very often we do not know which algorithms to use. We need to explore different ways to perform the experiments before determining what to include in the software project.

That’s why interactive programming comes into place for analytics project. R and MATLAB are these examples. However, they provide poor support for modularizing the codes. Python is a good tool that supports both modularization and interactive programming, but it takes an environment to run Python, which is very often a pain. Provided that a lot of good libraries are written in Java, having the need to perform both software development and data analytics, Scala, a JVM language that supports interactive programming, will be the next generation of programming language.

IMG_20150107_201432

Scala as the Next Influential Programming Language

I have been learning Scala. Some time ago, I doubted if it’s worth it as the learning curve is quite steep. But today I read the first chapter of my newly ordered book, titled Advanced Analytics Using Spark, a tool written in Scala for handling big data analytics, I reassured that I bet on the right thing.

I believe it will be the most common programming language the coming generation in this big data era because:

  1. It runs on JVM: a lot of libraries have been maintained as Java packages. Why do we discard Java if everything is getting more perfect from time to time? It is the same reason why we do not discard our old Fortran codes in scientific computing, but to wrap them in MATLAB or Python.
  2. It is an object-oriented: we learned about modularization and design patterns all the time. It keeps the strength of Java.
  3. It is functional: analytics involve functions. We want to handle functions flexibly. It shortens our codes, and makes our codes more readable (provided that we write appropriately). Mathematical manipulation is easier when we can handle operations with fewer codes. Lambda expressions are available.
  4. Interactive programming is available: what makes R and Python great is its availability to program interactively, especially handling data and mathematical models. And yes, this is also available in Scala.
  5. Parallel computing comes naturally: with actors or additional packages like Spark, Scala is well suited for scalable huge data computing. This is something that R and Python lack.

scalacodes

Continue reading “Scala as the Next Influential Programming Language”

Create a free website or blog at WordPress.com.

Up ↑