Essential Python Packages

Almost three years ago, I wrote a blog entry titled Useful Python Packages, which listed the essential packages that I deemed important. How has the list been changed over the past three years?

First of all, three years ago, most people were still writing Python 2.7. But now there is a trend to switch to Python 3. I admitted that I still have not started the switch yet, but in the short term, I will have no choice and I will.

What are some of the essential packages?
Numerical Packages

  • numpy: numerical Python, containing most basic numerical routines such as matrix manipulation, linear algebra, random sampling, numerical integration etc. There is a built-in wrapper for Fortran as well. Actually, numpy is so important that some Linux system includes it with Python.
  • scipy: scientific Python, containing some functions useful for scientific computing, such as sparse matrices, numerical differential equations, advanced linear algebra, special functions etc.
  • networkx: package that handles various types of networks
  • PuLP: linear programming
  • cvxopt: convex optimization

Data Visualization

  • matplotlib: basic plotting.
  • ggplot2: the ggplot2 counterpart in Python for producing quality publication plots.

Data Manipulation

  • pandas: data manipulation, working with data frames in Python, and save/load of various formats such as CSV and Excel

Machine Learning

  • scikit-learn: machine-learning library in Python, containing classes and functions for supervised and unsupervised learning

Probabilistic Programming

  • PyMC: Metropolis-Hasting algorithm
  • Edward: deep probabilistic programing

Deep Learning Frameworks

  • TensorFlow: because of Google’s marketing effort, TensorFlow is now the industrial standard for building deep learning networks, with rich source of mathematical functions, esp. for neural network cells, with GPU capability
  • Keras: containing routines of high-level layers for deep learning neural networks, with TensorFlow, Theano, or CNTK as the backbone
  • PyTorch: a rivalry against TensorFlow

Natural Language Processing

  • nltk: natural language processing toolkit for Python, containing bag-of-words model, tokenizer, stemmers, chunker, lemmatizers, part-of-speech taggers etc.
  • gensim: a useful natural language processing package useful for topic modeling, word-embedding, latent semantic indexing etc., running in a fast fashion
  • shorttext: text mining package good for handling short sentences, that provide high-level routines for training neural network classifiers, or generating feature represented by topic models or autoencodings.
  • spacy: industrial standard for natural language processing common tools


I can probably list more, but I think I covered most of them. If you do not find something useful, it is probably time for you to write a brand new package.

Document-Term Matrix: Text Mining in R and Python

In text mining, it is important to create the document-term matrix (DTM) of the corpus we are interested in. A DTM is basically a matrix, with documents designated by rows and words by columns, that the elements are the counts or the weights (usually by tf-idf). Subsequent analysis is usually based creatively on DTM.

Exploring with DTM therefore becomes an important issues with a good text-mining tool. How do we perform exploratory data analysis on DTM using R and Python? We will demonstrate it using the data set of U. S. Presidents’ Inaugural Address, preprocessed, and can be downloaded here.

R: textmineR

In R, we can use the package textmineR, which has been in introduced in a previous post. Together with other packages such as dplyr (for tidy data analysis) and snowBall (for stemming), load all of them at the beginning:


Load the datasets:

usprez.df<- read.csv('inaugural.csv', stringsAsFactors = FALSE)

Then we create the DTM, while we remove all digits and punctuations, make all letters lowercase, and stem all words using Porter stemmer.

dtm<- CreateDtm(usprez.df$speech,
                doc_names = usprez.df$yrprez,
                ngram_window = c(1, 1),
                lower = TRUE,
                remove_punctuation = TRUE,
                remove_numbers = TRUE,
                stem_lemma_function = wordStem)

Then defining a set of functions:

get.doc.tokens<- function(dtm, docid)
  dtm[docid, ] %>% %>% rename(count=".") %>%
  mutate(token=row.names(.)) %>% arrange(-count)

get.token.occurrences<- function(dtm, token)
  dtm[, token] %>% %>% rename(count=".") %>%
  mutate(token=row.names(.)) %>% arrange(-count)<- function(dtm, token) dtm[, token] %>% sum

get.doc.freq<- function(dtm, token)
  dtm[, token] %>% %>% rename(count=".") %>%
  filter(count>0) %>% pull(count) %>% length

Then we can happily extract information. For example, if we want to get the top-most common words in 2009’s Obama’s speech, enter:

dtm %>% get.doc.tokens('2009-Obama') %>% head(10)

Or which speeches have the word “change”: (but need to stem the word before extraction)

dtm %>% get.token.occurrences(wordStem('change')) %>% head(10)

You can also get the total number of occurrence of the words by:

dtm %>% get.doc.freq(wordStem('change'))   # gives 28

Python: shorttext

In Python, similar things can be done using the package shorttext, described in a previous post. It uses other packages such as pandas and stemming. Load all packages first:

import shorttext
import numpy as np
import pandas as pd
from stemming.porter import stem

import re

And define the preprocessing pipelines:

pipeline = [lambda s: re.sub('[^\w\s]', '', s),
            lambda s: re.sub('[\d]', '', s),
            lambda s: s.lower(),
            lambda s: ' '.join(map(stem, shorttext.utils.tokenize(s)))
txtpreproceesor = shorttext.utils.text_preprocessor(pipeline)

The function <code>txtpreprocessor</code> above perform the functions we talked about in R.

Load the dataset:

usprezdf = pd.read_csv('inaugural.csv')

The corpus needs to be preprocessed before putting into the DTM:

docids = list(usprezdf['yrprez'])    # defining document IDs
corpus = [txtpreproceesor(speech).split(' ') for speech in usprezdf['speech']]

Then create the DTM:

dtm = shorttext.utils.DocumentTermMatrix(corpus, docids=docids, tfidf=False)

Then we do the same thing as we have done above. To get the top-most common words in 2009’s Obama’s speech, enter:


Or we look up which speeches have the word “change”:


Or to get the document frequency of the word:


They Python and R codes give different document frequencies probably because the two stemmers work slightly differently.

Continue reading “Document-Term Matrix: Text Mining in R and Python”

Release of shorttext 0.5.4

The Python package for text mining shorttext has a new release: 0.5.4. It can be installed by typing in the command line:

pip install -U shorttext

For some people, you may need to install it from “root”, i.e., adding sudo in front of the command. Since the version 0.5 (including releases 0.5.1 and 0.5.4), there have been substantial addition of functionality, mostly about comparisons between short phrases without running a supervised or unsupervised machine learning algorithm, but calculating the “similarity” with various metrics, including:

  • soft Jaccard score (the same kind of fuzzy scores based on edit distance in SOCcer),
  • Word Mover’s distance (WMD, detailedly described in a previous post), and
  • Jaccard index due to word-embedding model.

For the soft Jaccard score due to edit distance, we can call it by:

>>> from shorttext.metrics.dynprog import soft_jaccard_score
>>> soft_jaccard_score(['book', 'seller'], ['blok', 'sellers'])     # gives 0.6716417910447762
>>> soft_jaccard_score(['police', 'station'], ['policeman'])        # gives 0.2857142857142858

The core of this code was written in C, and interfaced to Python using SWIG.

For the Word Mover’s Distance (WMD), while the source codes are the same as my previous post, it can now be called directly. First, load the modules and the word-embedding model:

>>> from shorttext.metrics.wasserstein import word_mover_distance
>>> from shorttext.utils import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')

And compute the WMD with a single function:

>>> word_mover_distance(['police', 'station'], ['policeman'], wvmodel)                      # gives 3.060708999633789
>>> word_mover_distance(['physician', 'assistant'], ['doctor', 'assistants'], wvmodel)      # gives 2.276337146759033

And the Jaccard index due to cosine distance in Word-embedding model can be called like this:

>>> from shorttext.metrics.embedfuzzy import jaccardscore_sents
>>> jaccardscore_sents('doctor', 'physician', wvmodel)   # gives 0.6401538990056869
>>> jaccardscore_sents('chief executive', 'computer cluster', wvmodel)   # gives 0.0022515450768836143
>>> jaccardscore_sents('topological data', 'data of topology', wvmodel)   # gives 0.67588977344632573

Most new functions can be found in this tutorial.

And there are some minor bugs fixed.

Continue reading “Release of shorttext 0.5.4”

Release of shorttext 0.3.3

On November 21, 2016, the Python package `shorttext’ was published. Until today, more than seven versions have been published. There have been a drastic architecture change, but the overall purpose is still the same, as summarized in the first introduction entry:

This package `shorttext‘ was designed to tackle all these problems… It contains the following features:

  • example data provided (including subject keywords and NIH RePORT);
  • text preprocessing;
  • pre-trained word-embedding support;
  • gensim topic models (LDA, LSI, Random Projections) and autoencoder;
  • topic model representation supported for supervised learning using scikit-learn;
  • cosine distance classification; and
  • neural network classification (including ConvNet, and C-LSTM).

And since the first version, there have been updates, as summarized in the documention (News):

Version 0.3.3 (Apr 19, 2017)

  • Deleted CNNEmbedVecClassifier.
  • Added script ShortTextWord2VecSimilarity.

Version 0.3.2 (Mar 28, 2017)

  • Bug fixed for gensim model I/O;
  • Console scripts update;
  • Neural networks up to Keras 2 standard (refer to this).

Version 0.3.1 (Mar 14, 2017)

  • Compact model I/O: all models are in single files;
  • Implementation of stacked generalization using logistic regression.

Version 0.2.1 (Feb 23, 2017)

  • Removal attempts of loading GloVe model, as it can be run using gensim script;
  • Confirmed compatibility of the package with tensorflow;
  • Use of spacy for tokenization, instead of nltk;
  • Use of stemming for Porter stemmer, instead of nltk;
  • Removal of nltk dependencies;
  • Simplifying the directory and module structures;
  • Module packages updated.

Although there are still additions that I would love to add, but it would not change the overall architecture. I may add some more supervised learning algorithms, but under the same network. The upcoming big additions will be generative models or seq2seq models, but I do not see them coming in the short term. I will add corpuses.

I may add tutorials if I have time.

I am thankful that there is probably some external collaboration with other Python packages. Some people have already made some useful contributions. It will be updated if more things are confirmed.

Continue reading “Release of shorttext 0.3.3”

Release of shorttext 0.2.1

The package shorttext has received attention for the past two months. A new release is released yesterday for the following updates:

  1. Removal attempts of loading GloVe model, as it can be run using gensim script;
  2. Confirmed compatibility of the package with Tensorflow;
  3. Use of spacy for tokenization, instead of nltk;
  4. Use of stemming for Porter stemmer, instead of nltk;
  5. Removal of nltk dependencies;
  6. Simplifying the directory and module structures;
  7. Module packages updated.

For #1, it actually removed a bug in the previous release. Instead, the users should convert the GloVe models into Word2Vec using the script provided by gensim.

For #3, #4, and #5, it is basically removing any nltk dependencies, because very few functionalities of nltk was used, and it is slow. For Porter stemmer, there is a light-weighted library stemming that performs the task perfectly. For tokenization, the tokenizer in spaCy is significantly faster than nltk, as shown in this Jupyter Notebook. We can do a simple test here, by first importing:

import time
import shorttext

Then load the NIH data:

nihdata =
nihtext = ' '.join(map(lambda item: ' '.join(item[1]), nihdata.items()))

Then find the time of using the tokenizer in nltk:

from nltk import word_tokenize

nltkt0 = time.time()
tokens = word_tokenize(nihtext)
nltkt1 = time.time()
print nltkt1-nltkt0, ' sec'   # output: 0.0224239826202 sec

On the other hand, using spaCy gives:

import spacy
nlp = spacy.load('en')

spt0 = time.time()
doc = nlp(unicode(nihtext))
tokens1 = [token for token in doc]
tokens1 = map(str, tokens1)
spt1 = time.time()

print spt1-spt0, ' sec'   # output: 0.00799107551575 sec

Clearly, spaCy is three times faster.

#6 indicates a simplification of package structure. Previously, for example, the neural network framework was in shorttext.classifiers.embed.nnlib.frameworks, but now it is shorttext.classifiers.frameworks. But the old package structure is kept for backward compatibility.

Continue reading “Release of shorttext 0.2.1”

rJava: Running Java from R, and Building R Packages Wrapping a .jar

While performing exploratory analysis, R is a good tool, although we sometimes want to invoke some stable Java tools. It is what the R Package rJava is for. To install it, simply enter on the R Console:


And to load it, enter:


As a simple demonstration, we find the length of a strength. Start the JVM, enter:


Then we create an instance of a Java string, and find its length as follow:

s <- .jnew('java/lang/String', 'Hello World!')
.jcall(s, 'I', 'length')

The first line, with the function .jnew, create a Java string instance. It is safe to put the full package path of the class. The second line, with the function .jcall, call the method length() for String. The second parameter, ‘I’, indicates it returns an integer. The type has to follow the JNI notation for native types. If it is an integer double array, type ‘I[[‘. If it is not a native class like String, use its total package path.

Example: Peter Norvig’s Spell Corrector Written in Scala

What should we do if we already have a .jar file we want to wrap? I would start with a simple one. Two years ago, I implemented Peter Norvig’s spell corrector (see his article) in Scala (which is a language for Java Virtual Machine (JVM) as well, see this entry), and posted on my Github repository: stephenhky/SpellCorrector. You may check out to your Eclipse or IntelliJ IDEA, and build a .jar file. (Or you can download the .jar file here.) For the program to run, do not forget to download his corpus named big.txt. The project has a class called SpellCorrector, which only the necessary codes are listed below:

package home.kwyho.spellcheck


import scala.collection.mutable.Map

class SpellCorrector {
 var wordCounts : Map[String, Int] = Map()
 val alphabets = ('a' to 'z').toSet

 def train(trainFile : File) = {
    val lines = Source.fromFile(trainFile) mkString
    val wordREPattern = "[A-Za-z]+"
    wordREPattern.r.findAllIn(lines).foreach( txtWord => {
       val word = txtWord.toLowerCase
       if (wordCounts.keySet contains(word)) {
          wordCounts(word) = wordCounts(word)+1
       } else {
          wordCounts += (word -> 1)

// other codes here ....

 def correct(wrongSpelling: String) : String = {
    val edit0words = Set(wrongSpelling) intersect wordCounts.keySet
    if (edit0words.size>0) return edit0words.maxBy( s => wordCounts(s))
    val edit1words = getEditOneSpellings(wrongSpelling)
    if (edit1words.size>0) return edit1words.maxBy( s => wordCounts(s))
    val edit2words = getEditTwoSpellings(wrongSpelling)
    edit2words.maxBy( s => wordCounts(s))

Putting the .jar file and big.txt into the same folder. Then initialize the JVM, and add the .jar file into the classpath:


Create an instance for SpellChecker, and train the corpus big.txt. Remember to put the whole package path as the class:

corrector <- .jnew('home/kwyho/spellcheck/SpellCorrector')
bigfile <- .jnew('java/io/File', 'big.txt')
.jcall(corrector, 'V', 'train', bigfile)

The first line create a SpellChecker instance, the second line create a File instance for big.txt, and the third line call the train() method. The JNI notation ‘V’ denotes ‘void.’ Entering ‘corrector’ will give a string indicates it is a Java object:

[1] "Java-Object{home.kwyho.spellcheck.SpellCorrector@5812f9ee}"

Then we can do spell correction by designing the following function:

correct<-function(word) {
   javaStrtext <- .jnew('java/lang/String', word)
   .jcall(corrector, 'Ljava/lang/String;', 'correct', javaStrtext)

Then you can easily perform spell correction as follow:


Some people put .class file instead of .jar file. In that case, you need to put the compiled Java class into the working directory. You can refer to an entry in Darren Wilkinson’s research blog for more details.

Building an R Package

It is another matter to build an R package that wraps a .jar file. In Hilary Parker’s entry and my previous entry, there are details about building an R package with roxygen2. There is also a documentation written by Tobias Verbeke.

So to start building it, in RStudio, start a project by clicking on the button “Project: (None)” on the top right corner of RStudio, choose “New Directory,” and then “R Package.” Type in the name (“RSpellCorrection” here), and specify a directory. Then click “Create Project.” A new RStudio window will show up. From the menu bar, choose “Build” > “Configure Build Tools”. Then click on “Configure…” button. There is a dialog box coming out. Check everything, and click “OK”.


The instructions above are rather detailed. But starting from now, I will skip the procedural details. Then start a file named, say, onLoad.R under the subfolder R/, and put the following codes there:

.onLoad <- function(libname, pkgname) {
  .jpackage(pkgname, lib.loc=libname)

This is a hook function that R will call when this package is being loaded. You must include it. Then in the file named DESCRIPTION, put in the relevant information:

Package: RSpellCorrection
Type: Package
Title: Spell Correction, Scala implementation run in R
Version: 0.1.0
Author: Kwan-Yuet Ho, Ph.D.
Maintainer: Kwan-Yuet Ho, Ph.D. <>
Description: Implementation of Peter Norvig's spell corrector in Scala, wrapped in R
License: N/A
LazyData: TRUE
RoxygenNote: 5.0.1
Depends: R(>= 2.7.0), rJava (>= 0.5-0)

Note the last line (“Depends…”), which you have to include because R will parse this line, and load rJava automatically. Remember there is a space between “>=” and the version number. Do not use library function in your code.

First, create a subfolder inst/java, and put the .jar file there.

Then start a file, called correct.R under subfolder R/, and write a function:

#' Retrieve a Java instance of SpellCorrector.
#' Retrieve a Java instance of SpellCorrector, with the training file
#' specified. Language model is trained before the instance is returned.
#' The spell corrector is adapted from Peter Norvig's demonstration.
#' @param filepath Path of the corpus.
#' @return a Java instance of SpellCorrector
#' @export
getcorrector<-function(filepath='big.txt') {
    .jaddLibrary('spellchecker', 'inst/java/spellcorrector.jar')
    corrector<- .jnew('home/kwyho/spellcheck/SpellCorrector')
    bigfile<- .jnew('java/io/File', filepath)
    .jcall(corrector, 'V', 'train', bigfile)

This return a Java instance of SpellCorrector as in previous section. There is a large block of text above the function, and they are for producing manual using roxygen2. The tag “@export” is important to tell roxygen2 to make this function visible to the users.

Then add another function:

#' Correct spelling.
#' Given an instance of SpellCorrector, return the most probably
#' corrected spelling of the given word.
#' @param word A token.
#' @param corrector A Java instance of SpellCorrector, given by \code{getcorrector}.
#' @return Corrected spelling
#' @export
correct<-function(word, corrector) {
    javaStrtext <- .jnew('java/lang/String', word)
    .jcall(corrector, 'Ljava/lang/String;', 'correct', javaStrtext)

Then click “Build & Reload” button on the “Build” Tab:


Then the package will be built, and reloaded. The manual documents (*.Rd) will be produced as well. You can then play with the spell corrector again like this:


Assuming you put this into the Github repository like I did (link here), you can install the new R package like this:


Then the R package will be downloaded, and installed for use. Or another option is that if you wish to install from your local directory, just enter:

install.packages('<path-to>/RSpellCorrection', repos = NULL, type = 'source')

A complete version of this R package can be found in my Github repository: stephenhky/RSpellCorrection. You may want to add a into the repository, which you need to know the Markdown language by referring to Lei Feng’s blog entry.

Continue reading “rJava: Running Java from R, and Building R Packages Wrapping a .jar”

Developing R Packages

Because of work, I developed two R packages to host the functions that I used a lot. It did bring me a lot of convenience, such as that I don’t have to start my data analysis in a particular folder and switch later on.

To do that, you need to use RStudio. Then you have to install devtools package by calling in the R console:


and load it by simply call:


And then you have to install the roxygen2 package by calling:


There are a lot of good tutorials about writing an R package. I especially like this Youtube video clip about building an R package with RStudio and roxygen2:

And Hilary Parker’s blog entry is useful as well.

On the other hand, if you are publishing your R package onto your Github repository, it would be nice to include a README file introducing your work. You need to know the Markdown language to write the file named, and put it onto the root folder of your repository. My friend, Qianli Deng, showed me this Lei Feng’s blog entry, which I found extremely useful. Markdown is remarkably simpler than LaTeX.

Continue reading “Developing R Packages”

Useful Python Packages

(Taken from, adapted from

Python is the basic programming languages if one wants to work on data nowadays. Its popularity comes with its intuitive syntax, its support of several programming paradigms, and the package numpy (Numerical Python). Yes, if you asked which package is a “must-have” outside the standard Python packages, I would certainly name numpy.

Let me list some useful packages that I have found useful:

  1. numpy: Numerical Python. Its basic data type is ndarray, which acts like a vector with vectorized calculation support. It makes Python to perform matrix calculation efficiently like MATLAB and Octave. It supports a lot of commonly used linear algebraic algorithms, such as eigenvalue problems, SVD etc. It is the basic of a lot of other Python packages that perform heavy numerical computation. It is such an important package that, in some operating systems, numpy comes with Python as well.
  2. scipy: Scientific Python. It needs numpy, but it supports also sparse matrices, special functions, statistics, numerical integration…
  3. matplotlib: Graph plotting.
  4. scikit-learn: machine learning library. It contains a number of supervised and unsupervised learning algorithms.
  5. nltk: natural language processing. It provides not only basic tools like stemmers, lemmatizers, but also some algorithms like maximum entropy, tf-idf vectorizer etc. It provides a few corpuses, and supports WordNet dictionary.
  6. gensim: another useful natural language processing package with an emphasis on topic modeling. It mainly supports Word2Vec, latent semantic indexing (LSI), and latent Dirichlet allocation (LDA). It is convenient to construct term-document matrices, and convert them to matrices in numpy or scipy.
  7. networkx: a package that supports both undirected and directed graphs. It provides basic algorithms used in graphs.
  8. sympy: Symbolic Python. I am not good at this package, but I know mathics and SageMath are both based on it.
  9. pandas: it supports data frame handling like R. (I have not used this package as I am a heavy R user.)

Of course, if you are a numerical developer, to save you a good life, install Anaconda.

There are some other packages that are useful, such as PyCluster (clustering), xlrd (Excel files read/write), PyGame (writing games)… But since I have not used them, I would rather mention it in this last paragraph, not to endorse but avoid devaluing it.

Don’t forget to type in your IPython Notebook:

import antigravity

Continue reading “Useful Python Packages”

Create a free website or blog at

Up ↑