Release of shorttext 0.2.1

The package shorttext has received attention for the past two months. A new release is released yesterday for the following updates:

  1. Removal attempts of loading GloVe model, as it can be run using gensim script;
  2. Confirmed compatibility of the package with Tensorflow;
  3. Use of spacy for tokenization, instead of nltk;
  4. Use of stemming for Porter stemmer, instead of nltk;
  5. Removal of nltk dependencies;
  6. Simplifying the directory and module structures;
  7. Module packages updated.

For #1, it actually removed a bug in the previous release. Instead, the users should convert the GloVe models into Word2Vec using the script provided by gensim.

For #3, #4, and #5, it is basically removing any nltk dependencies, because very few functionalities of nltk was used, and it is slow. For Porter stemmer, there is a light-weighted library stemming that performs the task perfectly. For tokenization, the tokenizer in spaCy is significantly faster than nltk, as shown in this Jupyter Notebook. We can do a simple test here, by first importing:

import time
import shorttext

Then load the NIH data:

nihdata = shorttext.data.nihreports()
nihtext = ' '.join(map(lambda item: ' '.join(item[1]), nihdata.items()))

Then find the time of using the tokenizer in nltk:

from nltk import word_tokenize

nltkt0 = time.time()
tokens = word_tokenize(nihtext)
nltkt1 = time.time()
print nltkt1-nltkt0, ' sec'   # output: 0.0224239826202 sec

On the other hand, using spaCy gives:

import spacy
nlp = spacy.load('en')

spt0 = time.time()
doc = nlp(unicode(nihtext))
tokens1 = [token for token in doc]
tokens1 = map(str, tokens1)
spt1 = time.time()

print spt1-spt0, ' sec'   # output: 0.00799107551575 sec

Clearly, spaCy is three times faster.

#6 indicates a simplification of package structure. Previously, for example, the neural network framework was in shorttext.classifiers.embed.nnlib.frameworks, but now it is shorttext.classifiers.frameworks. But the old package structure is kept for backward compatibility.

  • Comparing the tokenizer in spaCy and nltk. [Jupyter]
  • “Python Package for Short Text Mining,” Everything About Data Analytics, WordPress (2016). [WordPress]
  • Python package: shorttext. [PyPI]
  • Package documentation. [PythonHosted]
  • Github: stephenhky/shorttext. [Github]
  • “Short Text Categorization using Deep Neural Networks and Word-Embedding Models,” Everything About Data Analytics, WordPress (2016). [WordPress]
Advertisements

One thought on “Release of shorttext 0.2.1

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s