The package shorttext has received attention for the past two months. A new release is released yesterday for the following updates:
- Removal attempts of loading GloVe model, as it can be run using gensim script;
- Confirmed compatibility of the package with Tensorflow;
- Use of spacy for tokenization, instead of nltk;
- Use of stemming for Porter stemmer, instead of nltk;
- Removal of nltk dependencies;
- Simplifying the directory and module structures;
- Module packages updated.
For #1, it actually removed a bug in the previous release. Instead, the users should convert the GloVe models into Word2Vec using the script provided by gensim.
For #3, #4, and #5, it is basically removing any nltk dependencies, because very few functionalities of nltk was used, and it is slow. For Porter stemmer, there is a light-weighted library stemming that performs the task perfectly. For tokenization, the tokenizer in spaCy is significantly faster than nltk, as shown in this Jupyter Notebook. We can do a simple test here, by first importing:
import time import shorttext
Then load the NIH data:
nihdata = shorttext.data.nihreports() nihtext = ' '.join(map(lambda item: ' '.join(item[1]), nihdata.items()))
Then find the time of using the tokenizer in nltk:
from nltk import word_tokenize nltkt0 = time.time() tokens = word_tokenize(nihtext) nltkt1 = time.time() print nltkt1-nltkt0, ' sec' # output: 0.0224239826202 sec
On the other hand, using spaCy gives:
import spacy nlp = spacy.load('en') spt0 = time.time() doc = nlp(unicode(nihtext)) tokens1 = [token for token in doc] tokens1 = map(str, tokens1) spt1 = time.time() print spt1-spt0, ' sec' # output: 0.00799107551575 sec
Clearly, spaCy is three times faster.
#6 indicates a simplification of package structure. Previously, for example, the neural network framework was in shorttext.classifiers.embed.nnlib.frameworks, but now it is shorttext.classifiers.frameworks. But the old package structure is kept for backward compatibility.