Release of shorttext 0.2.1

The package shorttext has received attention for the past two months. A new release is released yesterday for the following updates:

  1. Removal attempts of loading GloVe model, as it can be run using gensim script;
  2. Confirmed compatibility of the package with Tensorflow;
  3. Use of spacy for tokenization, instead of nltk;
  4. Use of stemming for Porter stemmer, instead of nltk;
  5. Removal of nltk dependencies;
  6. Simplifying the directory and module structures;
  7. Module packages updated.

For #1, it actually removed a bug in the previous release. Instead, the users should convert the GloVe models into Word2Vec using the script provided by gensim.

For #3, #4, and #5, it is basically removing any nltk dependencies, because very few functionalities of nltk was used, and it is slow. For Porter stemmer, there is a light-weighted library stemming that performs the task perfectly. For tokenization, the tokenizer in spaCy is significantly faster than nltk, as shown in this Jupyter Notebook. We can do a simple test here, by first importing:

import time
import shorttext

Then load the NIH data:

nihdata =
nihtext = ' '.join(map(lambda item: ' '.join(item[1]), nihdata.items()))

Then find the time of using the tokenizer in nltk:

from nltk import word_tokenize

nltkt0 = time.time()
tokens = word_tokenize(nihtext)
nltkt1 = time.time()
print nltkt1-nltkt0, ' sec'   # output: 0.0224239826202 sec

On the other hand, using spaCy gives:

import spacy
nlp = spacy.load('en')

spt0 = time.time()
doc = nlp(unicode(nihtext))
tokens1 = [token for token in doc]
tokens1 = map(str, tokens1)
spt1 = time.time()

print spt1-spt0, ' sec'   # output: 0.00799107551575 sec

Clearly, spaCy is three times faster.

#6 indicates a simplification of package structure. Previously, for example, the neural network framework was in shorttext.classifiers.embed.nnlib.frameworks, but now it is shorttext.classifiers.frameworks. But the old package structure is kept for backward compatibility.

Continue reading “Release of shorttext 0.2.1”


Generative Adversarial Networks

Recently I have been drawn to generative models, such as LDA (latent Dirichlet allocation) and other topic models. In deep learning, there are a few examples, such as FVBN (fully visible belief networks), VAE (variational autoencoder), RBM (restricted Boltzmann machine) etc. Recently I have been reading about GAN (generative adversarial networks), first published by Ian Goodfellow and his colleagues and collaborators. Goodfellow published his talk in NIPS 2016 on arXiv recently.

Yesterday I attended an event at George Mason University organized by Data Science DC Meetup Group. Jennifer Sleeman talked about GAN. It was a very good talk.

In GAN, there are two important functions, namely, the discriminator (D), and the generator (G). As a generative model, the distribution of training data, all labeled positive, can be thought of the distribution that the generator was trained to produce. The discriminator discriminates the data with positive labels and those with negative labels. Then the generator tries to generate data, probably from noises, which should be negative, to fake the discriminator to see it as positive. This process repeats iteratively, and eventually the generator is trained to produce data that are close to the distribution of the training data, and the discriminator will be confused to classify the generated data as positive with probability \frac{1}{2}. The intuition of this competitive game is from minimax game in game theory. The formal algorithm is described in the original paper as follow:


The original paper discussed about that the distribution of final generated data identical to that of the training data being the optimal for the model, and argued using the Jensen-Shannon (JS) divergence. Ferenc Huszár discussed in his blog about the relations between maximum likelihood, Kullback-Leibler (KL) divergence, and Jensen-Shannon (JS) divergence.

I have asked the speaker a few questions about the concepts of GAN as well.

GAN is not yet a very sophisticated framework, but it already found a few industrial use. Some of its descendants include LapGAN (Laplacian GAN), and DCGAN (deep convolutional GAN). Applications include voice generation, image super-resolution, pix2pix (image-to-image translation), text-to-image synthesis, iGAN (interactive GAN) etc.

Adversarial training is the coolest thing since sliced bread.” – Yann LeCun



Continue reading “Generative Adversarial Networks”

Blog at

Up ↑