Release of shorttext 0.3.3

On November 21, 2016, the Python package `shorttext’ was published. Until today, more than seven versions have been published. There have been a drastic architecture change, but the overall purpose is still the same, as summarized in the first introduction entry:

This package `shorttext‘ was designed to tackle all these problems… It contains the following features:

  • example data provided (including subject keywords and NIH RePORT);
  • text preprocessing;
  • pre-trained word-embedding support;
  • gensim topic models (LDA, LSI, Random Projections) and autoencoder;
  • topic model representation supported for supervised learning using scikit-learn;
  • cosine distance classification; and
  • neural network classification (including ConvNet, and C-LSTM).

And since the first version, there have been updates, as summarized in the documention (News):

Version 0.3.3 (Apr 19, 2017)

  • Deleted CNNEmbedVecClassifier.
  • Added script ShortTextWord2VecSimilarity.

Version 0.3.2 (Mar 28, 2017)

  • Bug fixed for gensim model I/O;
  • Console scripts update;
  • Neural networks up to Keras 2 standard (refer to this).

Version 0.3.1 (Mar 14, 2017)

  • Compact model I/O: all models are in single files;
  • Implementation of stacked generalization using logistic regression.

Version 0.2.1 (Feb 23, 2017)

  • Removal attempts of loading GloVe model, as it can be run using gensim script;
  • Confirmed compatibility of the package with tensorflow;
  • Use of spacy for tokenization, instead of nltk;
  • Use of stemming for Porter stemmer, instead of nltk;
  • Removal of nltk dependencies;
  • Simplifying the directory and module structures;
  • Module packages updated.

Although there are still additions that I would love to add, but it would not change the overall architecture. I may add some more supervised learning algorithms, but under the same network. The upcoming big additions will be generative models or seq2seq models, but I do not see them coming in the short term. I will add corpuses.

I may add tutorials if I have time.

I am thankful that there is probably some external collaboration with other Python packages. Some people have already made some useful contributions. It will be updated if more things are confirmed.

Continue reading “Release of shorttext 0.3.3”

Useful Python Packages

python
(Taken from http://latticeqcd.org/pythonorg/static/images/antigravity.png, adapted from http://xkcd.com/353/)

Python is the basic programming languages if one wants to work on data nowadays. Its popularity comes with its intuitive syntax, its support of several programming paradigms, and the package numpy (Numerical Python). Yes, if you asked which package is a “must-have” outside the standard Python packages, I would certainly name numpy.

Let me list some useful packages that I have found useful:

  1. numpy: Numerical Python. Its basic data type is ndarray, which acts like a vector with vectorized calculation support. It makes Python to perform matrix calculation efficiently like MATLAB and Octave. It supports a lot of commonly used linear algebraic algorithms, such as eigenvalue problems, SVD etc. It is the basic of a lot of other Python packages that perform heavy numerical computation. It is such an important package that, in some operating systems, numpy comes with Python as well.
  2. scipy: Scientific Python. It needs numpy, but it supports also sparse matrices, special functions, statistics, numerical integration…
  3. matplotlib: Graph plotting.
  4. scikit-learn: machine learning library. It contains a number of supervised and unsupervised learning algorithms.
  5. nltk: natural language processing. It provides not only basic tools like stemmers, lemmatizers, but also some algorithms like maximum entropy, tf-idf vectorizer etc. It provides a few corpuses, and supports WordNet dictionary.
  6. gensim: another useful natural language processing package with an emphasis on topic modeling. It mainly supports Word2Vec, latent semantic indexing (LSI), and latent Dirichlet allocation (LDA). It is convenient to construct term-document matrices, and convert them to matrices in numpy or scipy.
  7. networkx: a package that supports both undirected and directed graphs. It provides basic algorithms used in graphs.
  8. sympy: Symbolic Python. I am not good at this package, but I know mathics and SageMath are both based on it.
  9. pandas: it supports data frame handling like R. (I have not used this package as I am a heavy R user.)

Of course, if you are a numerical developer, to save you a good life, install Anaconda.

There are some other packages that are useful, such as PyCluster (clustering), xlrd (Excel files read/write), PyGame (writing games)… But since I have not used them, I would rather mention it in this last paragraph, not to endorse but avoid devaluing it.

Don’t forget to type in your IPython Notebook:

import antigravity

Continue reading “Useful Python Packages”

Create a free website or blog at WordPress.com.

Up ↑