On November 21, 2016, the Python package `shorttext’ was published. Until today, more than seven versions have been published. There have been a drastic architecture change, but the overall purpose is still the same, as summarized in the first introduction entry:
This package `shorttext‘ was designed to tackle all these problems… It contains the following features:
- example data provided (including subject keywords and NIH RePORT);
- text preprocessing;
- pre-trained word-embedding support;
- gensim topic models (LDA, LSI, Random Projections) and autoencoder;
- topic model representation supported for supervised learning using scikit-learn;
- cosine distance classification; and
- neural network classification (including ConvNet, and C-LSTM).
And since the first version, there have been updates, as summarized in the documention (News):
Version 0.3.3 (Apr 19, 2017)
- Deleted CNNEmbedVecClassifier.
- Added script ShortTextWord2VecSimilarity.
Version 0.3.2 (Mar 28, 2017)
- Bug fixed for gensim model I/O;
- Console scripts update;
- Neural networks up to Keras 2 standard (refer to this).
Version 0.3.1 (Mar 14, 2017)
- Compact model I/O: all models are in single files;
- Implementation of stacked generalization using logistic regression.
Version 0.2.1 (Feb 23, 2017)
- Removal attempts of loading GloVe model, as it can be run using gensim script;
- Confirmed compatibility of the package with tensorflow;
- Use of spacy for tokenization, instead of nltk;
- Use of stemming for Porter stemmer, instead of nltk;
- Removal of nltk dependencies;
- Simplifying the directory and module structures;
- Module packages updated.
Although there are still additions that I would love to add, but it would not change the overall architecture. I may add some more supervised learning algorithms, but under the same network. The upcoming big additions will be generative models or seq2seq models, but I do not see them coming in the short term. I will add corpuses.
I may add tutorials if I have time.
I am thankful that there is probably some external collaboration with other Python packages. Some people have already made some useful contributions. It will be updated if more things are confirmed.
- Python package: shorttext. [PyPI]
- Package documentation. [PythonHosted]
- Github: stephenhky/shorttext. [Github]
- “Short Text Categorization using Deep Neural Networks and Word-Embedding Models,” Everything About Data Analytics, WordPress (2016). [WordPress]
- “Release of shorttext 0.2.1,” Everything About Data Analytics, WordPress (2017). [WordPress]