Release of shorttext 0.3.3

On November 21, 2016, the Python package `shorttext’ was published. Until today, more than seven versions have been published. There have been a drastic architecture change, but the overall purpose is still the same, as summarized in the first introduction entry:

This package `shorttext‘ was designed to tackle all these problems… It contains the following features:

  • example data provided (including subject keywords and NIH RePORT);
  • text preprocessing;
  • pre-trained word-embedding support;
  • gensim topic models (LDA, LSI, Random Projections) and autoencoder;
  • topic model representation supported for supervised learning using scikit-learn;
  • cosine distance classification; and
  • neural network classification (including ConvNet, and C-LSTM).

And since the first version, there have been updates, as summarized in the documention (News):

Version 0.3.3 (Apr 19, 2017)

  • Deleted CNNEmbedVecClassifier.
  • Added script ShortTextWord2VecSimilarity.

Version 0.3.2 (Mar 28, 2017)

  • Bug fixed for gensim model I/O;
  • Console scripts update;
  • Neural networks up to Keras 2 standard (refer to this).

Version 0.3.1 (Mar 14, 2017)

  • Compact model I/O: all models are in single files;
  • Implementation of stacked generalization using logistic regression.

Version 0.2.1 (Feb 23, 2017)

  • Removal attempts of loading GloVe model, as it can be run using gensim script;
  • Confirmed compatibility of the package with tensorflow;
  • Use of spacy for tokenization, instead of nltk;
  • Use of stemming for Porter stemmer, instead of nltk;
  • Removal of nltk dependencies;
  • Simplifying the directory and module structures;
  • Module packages updated.

Although there are still additions that I would love to add, but it would not change the overall architecture. I may add some more supervised learning algorithms, but under the same network. The upcoming big additions will be generative models or seq2seq models, but I do not see them coming in the short term. I will add corpuses.

I may add tutorials if I have time.

I am thankful that there is probably some external collaboration with other Python packages. Some people have already made some useful contributions. It will be updated if more things are confirmed.

Continue reading “Release of shorttext 0.3.3”

Advertisements

Generative Adversarial Networks

Recently I have been drawn to generative models, such as LDA (latent Dirichlet allocation) and other topic models. In deep learning, there are a few examples, such as FVBN (fully visible belief networks), VAE (variational autoencoder), RBM (restricted Boltzmann machine) etc. Recently I have been reading about GAN (generative adversarial networks), first published by Ian Goodfellow and his colleagues and collaborators. Goodfellow published his talk in NIPS 2016 on arXiv recently.

Yesterday I attended an event at George Mason University organized by Data Science DC Meetup Group. Jennifer Sleeman talked about GAN. It was a very good talk.

In GAN, there are two important functions, namely, the discriminator (D), and the generator (G). As a generative model, the distribution of training data, all labeled positive, can be thought of the distribution that the generator was trained to produce. The discriminator discriminates the data with positive labels and those with negative labels. Then the generator tries to generate data, probably from noises, which should be negative, to fake the discriminator to see it as positive. This process repeats iteratively, and eventually the generator is trained to produce data that are close to the distribution of the training data, and the discriminator will be confused to classify the generated data as positive with probability \frac{1}{2}. The intuition of this competitive game is from minimax game in game theory. The formal algorithm is described in the original paper as follow:

gan_algorithm

The original paper discussed about that the distribution of final generated data identical to that of the training data being the optimal for the model, and argued using the Jensen-Shannon (JS) divergence. Ferenc Huszár discussed in his blog about the relations between maximum likelihood, Kullback-Leibler (KL) divergence, and Jensen-Shannon (JS) divergence.

I have asked the speaker a few questions about the concepts of GAN as well.

GAN is not yet a very sophisticated framework, but it already found a few industrial use. Some of its descendants include LapGAN (Laplacian GAN), and DCGAN (deep convolutional GAN). Applications include voice generation, image super-resolution, pix2pix (image-to-image translation), text-to-image synthesis, iGAN (interactive GAN) etc.

Adversarial training is the coolest thing since sliced bread.” – Yann LeCun

16427477_10105464843466738_3344635019272539994_n

 

Continue reading “Generative Adversarial Networks”

Author-Topic Models in gensim

Recently, gensim, a Python package for topic modeling, released a new version of its package which includes the implementation of author-topic models.

The most famous topic model is undoubtedly latent Dirichlet allocation (LDA), as proposed by David Blei and his colleagues. Such a topic model is a generative model, described by the following directed graphical models:

lda_pic

In the graph, \alpha and \beta are hyperparameters. \theta is the topic distribution of a document, z is the topic for each word in each document, \phi is the word distributions for each topic, and w is the generated word for a place in a document.

There are models similar to LDA, such as correlated topic models (CTM), where \phi is generated by not only \beta but also a covariance matrix \Sigma.

There exists an author model, which is a simpler topic model. The difference is that the words in the document are generated from the author for each document, as in the following graphical model. x is the author of a given word in the document.

author_pic

Combining these two, it gives the author-topic model as a hybrid, as shown below:

authortopic_pic

The new release of Python package, gensim, supported the author-topic model, as demonstrated in this Jupyter Notebook.

P.S.:

  • I am also aware that there is another topic model called structural topic model (STM), developed for the field of social science. However, there is no Python package supporting this, but an R package, called stm, is available for it. You can refer to their homepage too.
  • I may consider including author-topic model and STM in the next release of the Python package shorttext.

Continue reading “Author-Topic Models in gensim”

Python Package for Short Text Mining

There has been a lot of methods for natural language processing and text mining. However, in tweets, surveys, Facebook, or many online data, texts are short, lacking data to build enough information. Traditional bag-of-words (BOW) model gives sparse vector representation.

Semantic relations between words are important, because we usually do not have enough data to capture the similarity between words. We do not want “drive” and “drives,” or “driver” and “chauffeur” to be completely different.

The relation between or order of words become important as well. Or we want to capture the concepts that may be correlated in our training dataset.

We have to represent these texts in a special way and perform supervised learning with traditional machine learning algorithms or deep learning algorithms.

This package `shorttext‘ was designed to tackle all these problems. It is not a completely new invention, but putting everything known together. It contains the following features:

  • example data provided (including subject keywords and NIH RePORT);
  • text preprocessing;
  • pre-trained word-embedding support;
  • gensim topic models (LDA, LSI, Random Projections) and autoencoder;
  • topic model representation supported for supervised learning using scikit-learn;
  • cosine distance classification; and
  • neural network classification (including ConvNet, and C-LSTM).

Readers can refer this to the documentation.

Continue reading “Python Package for Short Text Mining”

Combining the Best of All Worlds

There are many learning algorithms that perform classification tasks. However, very often the situation is that one classifier is better on certain data points, but another is better on other. It would be nice if there are ways to combine the best of all these available classifiers.

Voting

The simplest way of combining classifiers to improve the classification is democracy: voting. When there are n classifiers that output the same classes, the result can be simply cast by a democratic vote. This method works quite well in many problems. Sometimes, we may need to give various weights to different classifiers to improve the performance.

Bagging and Boosting

Sometimes we can generate many classifiers with the handful amount of data available with bagging and boosting. By bagging and boosting, different classifiers are built with the same learning algorithm but with different datasets. “Bagging builds different versions of the training set by sampling with replacement,” and “boosting obtains the different training sets by focusing on the instances that are misclassified by the previously trained classifiers.” [Sesmero etal. 2015]

Fusion

Performance of classifiers depends not only on the learning algorithms and the data, but also the set of features used. While feature generation itself is a bigger and a more important problem (not to be discussed), we do have various ways to combine different features. Sometimes we separate features into different classifiers in which the answers are to be combined, or combine all these features into one classifier. The former is called late fusion, while the latter early fusion.

Stacking

We can also treat the prediction results of various classifiers as features of another classifiers. It is called stacking. [Wolpert 1992] “Stacking generates the members of the Stacking ensemble using several learning algorithms and subsequently uses another algorithm to learn how to combine their outputs.” [Sesmero etal. 2015] Some recent implementation in computational epidemiology employ stacking as well. [Russ et. al. 2016]

Hidden Topics and Embedding

There is also a special type of feature generation of one classifier, using hidden topic or embedding as the latent vectors. We can generate a set of latent topics according to the data available using latent Dirichlet allocation (LDA) or correlated topic models (CTM), and describe each datasets using these topics as the input to another classifier. [Phan et. al. 2011] Another way is to represent the data using embedding vectors (such as time-series embedding, Word2Vec, or LDA2Vec etc.) as the input of another classifier. [Czerny 2015]

Continue reading “Combining the Best of All Worlds”

Local and Global: Words and Topics

Word2Vec has hit the NLP world for a while, as it is a nice method for word embeddings or word representations. Its use of skip-gram model and deep learning made a big impact too. It has been my favorite toy indeed. However, even though the words do have a correlation across a small segment of text, it is still a local coherence. On the other hand, topic models such as latent Dirichlet allocation (LDA) capture the distribution of words within a topic, and that of topics within a document etc. And it provides a representation of a new document in terms of a topic.

In my previous blog entry, I introduced Chris Moody’s LDA2Vec algorithm (see: his SlideShare). Unfortunately, not many papers or blogs have covered this new algorithm too much despite its potential. The API is not completely well documented yet, although you can see its example from its source code on its Github. In its documentation, it gives an example of deriving topics from an array of random numbers, in its lda2vec/lda2vec.py code:

from lda2vec import LDA2Vec
n_words = 10
n_docs = 15
n_hidden = 8
n_topics = 2
n_obs = 300
words = np.random.randint(n_words, size=(n_obs))
_, counts = np.unique(words, return_counts=True)
model = LDA2Vec(n_words, n_hidden, counts)
model.add_categorical_feature(n_docs, n_topics, name='document id')
model.finalize()
doc_ids = np.arange(n_obs) % n_docs
loss = model.fit_partial(words, 1.0, categorical_features=doc_ids)

A more comprehensive example is in examples/twenty_newsgroup/lda.py .

Besides, LDA2Vec, there are some related research work on topical word embeddings too. A group of Australian and American scientists studied about the topic modeling with pre-trained Word2Vec (or GloVe) before performing LDA. (See: their paper and code) On the other hand, another group with Chinese and Singaporean scientists performs LDA, and then trains a Word2Vec model. (See: their paper and code) And LDA2Vec concatenates the Word2Vec and LDA representation, like an early fusion.

No matter what, representations with LDA models (or related topic modeling such as correlated topic models (CTM)) can be useful even outside NLP. I have found it useful at some intermediate layer of calculation lately.

proj3_lda20structure
Graphical Representations of LDA model. Source: http://salsahpc.indiana.edu/b649proj/images/proj3_LDA%20structure.png

Continue reading “Local and Global: Words and Topics”

LDA2Vec: a hybrid of LDA and Word2Vec

Both LDA (latent Dirichlet allocation) and Word2Vec are two important algorithms in natural language processing (NLP). LDA is a widely used topic modeling algorithm, which seeks to find the topic distribution in a corpus, and the corresponding word distributions within each topic, with a prior Dirichlet distribution. Word2Vec is a vector-representation model, trained from RNN (recurrent neural network), to seek a continuous representation for words.

They are both very useful, but LDA deals with words and documents globally, and Word2Vec locally (depending on adjacent words in the training data). A LDA vector is so sparse that the users can interpret the topic easily, but it is inflexible. Word2Vec’s representation is not human-interpretable, but it is easy to use. In his slides, Chris Moody recently devises a topic modeling algorithm, called LDA2Vec, which is a hybrid of the two, to get the best out of the two algorithms.

Honestly, I never used this algorithm. I rarely talk about something I didn’t even try, but I want to raise awareness so that more people know about it when I come to use it. To me, it looks like concatenating two vectors with some hyperparameters, but  the source codes rejects this claim. It is a topic model algorithm.

There are not many blogs or papers talking about LDA2Vec yet. I am looking forward to learning more about it when there are more awareness.

lda2vec
Jupyter Notebook for LDA2Vec Demonstration [link]
Continue reading “LDA2Vec: a hybrid of LDA and Word2Vec”

Computational Journalism

We “sensed” what has been the current hot issues in the past (and we still often do today.) Methods of “sensing,” or “detecting”, is now more sophisticated however as the computational technologies are now more advanced. The methods involved can be collected to a field called “computational journalism.”

Recently, there is a blog post by Jeiran about understanding the public impression about Iran using computational methods. She divided the question into the temporal and topical perspectives. The temporal perspective is about various time-varying patterns of the number of related news articles; the topical perspective is about the distribution of various topics, using latent Dirichlet allocation (LDA), and Bayes’ Theorem. The blog post is worth reading.

In February last year, there was a video clip online that Daeil Kim, a data scientist at New York Times, spoke at NYC Data Science Meetup. Honestly, I still have not watched it yet (but I think I should have.) What his work is also about computational journalism, on his algorithm, and LDA.

Of course, computational journalism is the application of natural language processing and machine learning on news articles… However, as a computational physicist has to know physics, a computational journalist has to know journalism. A data scientist has to be someone who knows the technology and the subject matter.

Continue reading “Computational Journalism”

Useful Python Packages

python
(Taken from http://latticeqcd.org/pythonorg/static/images/antigravity.png, adapted from http://xkcd.com/353/)

Python is the basic programming languages if one wants to work on data nowadays. Its popularity comes with its intuitive syntax, its support of several programming paradigms, and the package numpy (Numerical Python). Yes, if you asked which package is a “must-have” outside the standard Python packages, I would certainly name numpy.

Let me list some useful packages that I have found useful:

  1. numpy: Numerical Python. Its basic data type is ndarray, which acts like a vector with vectorized calculation support. It makes Python to perform matrix calculation efficiently like MATLAB and Octave. It supports a lot of commonly used linear algebraic algorithms, such as eigenvalue problems, SVD etc. It is the basic of a lot of other Python packages that perform heavy numerical computation. It is such an important package that, in some operating systems, numpy comes with Python as well.
  2. scipy: Scientific Python. It needs numpy, but it supports also sparse matrices, special functions, statistics, numerical integration…
  3. matplotlib: Graph plotting.
  4. scikit-learn: machine learning library. It contains a number of supervised and unsupervised learning algorithms.
  5. nltk: natural language processing. It provides not only basic tools like stemmers, lemmatizers, but also some algorithms like maximum entropy, tf-idf vectorizer etc. It provides a few corpuses, and supports WordNet dictionary.
  6. gensim: another useful natural language processing package with an emphasis on topic modeling. It mainly supports Word2Vec, latent semantic indexing (LSI), and latent Dirichlet allocation (LDA). It is convenient to construct term-document matrices, and convert them to matrices in numpy or scipy.
  7. networkx: a package that supports both undirected and directed graphs. It provides basic algorithms used in graphs.
  8. sympy: Symbolic Python. I am not good at this package, but I know mathics and SageMath are both based on it.
  9. pandas: it supports data frame handling like R. (I have not used this package as I am a heavy R user.)

Of course, if you are a numerical developer, to save you a good life, install Anaconda.

There are some other packages that are useful, such as PyCluster (clustering), xlrd (Excel files read/write), PyGame (writing games)… But since I have not used them, I would rather mention it in this last paragraph, not to endorse but avoid devaluing it.

Don’t forget to type in your IPython Notebook:

import antigravity

Continue reading “Useful Python Packages”

Create a free website or blog at WordPress.com.

Up ↑