Short Text Mining using Advanced Keras Layers and Maxent: shorttext 0.4.1

On 07/28/2017, shorttext published its release 0.4.1, with a few important updates. To install it, type the following in the OS X / Linux command line:

>>> pip install -U shorttext

The documentation in PythonHosted.org has been abandoned. It has been migrated to readthedocs.org. (URL: http://shorttext.readthedocs.io/ or http:// shorttext.rtfd.io)

Exploiting the Word-Embedding Layer

This update is mainly due to an important update in gensim, motivated by earlier shorttext‘s effort in integrating scikit-learn and keras. And gensim also provides a keras layer, on the same footing as other neural networks, activation function, or dropout layers, for Word2Vec models. Because shorttext has been making use of keras layers for categorization, such advance in gensim in fact makes it a natural step to add an embedding layer of all neural networks provided in shorttext. How to do it? (See shorttext tutorial for “Deep Neural Networks with Word Embedding.”)

import shorttext
wvmodel = shorttext.utils.load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')   # load the pre-trained Word2Vec model
trainclassdict = shorttext.data.subjectkeywords()   # load an example data set

 

To train a model, you can do it the old way, or do it the new way with additional gensim function:

kmodel = shorttext.classifiers.frameworks.CNNWordEmbed(wvmodel=wvmodel, nb_labels=len(trainclassdict.keys()), vecsize=100, with_gensim=True)   # keras model, setting with_gensim=True
classifier = shorttext.classifiers.VarNNEmbeddedVecClassifier(wvmodel, with_gensim=True, vecsize=100)   # instantiate the classifier, setting with_gensim=True
classifier.train(trainclassdict, kmodel)

The parameters with_gensim in both CNNWordEmbed and VarNNEmbeddedVecClassifier are set to be False by default, because of backward compatibility. However, setting it to be True will enable it to use the new gensim Word2Vec layer.

These change in gensim and shorttext are the works mainly contributed by Chinmaya Pancholi, a very bright student at Indian Institute of Technology, Kharagpur, and a GSoC (Google Summer of Code) student in 2017. He revolutionized gensim by integrating scikit-learn and keras into gensim. He also used what he did in gensim to improve the pipelines of shorttext. He provided valuable technical suggestions. You can read his GSoC proposal, and his blog posts in RaRe Technologies, Inc. Chinmaya has been diligently mentored by Ivan Menshikh and Lev Konstantinovskiy of RaRe Technologies.

Maxent Classifier

Another important update is the adding of maximum entropy (maxent) classifier. (See the corresponding tutorial on “Maximum Entropy (MaxEnt) Classifier.”) I will devote a separate entry on the theory, but it is very easy to use it,

import shorttext
from shorttext.classifiers import MaxEntClassifier

classifier = MaxEntClassifier()

Use the NIHReports dataset as the example:

classdict = shorttext.data.nihreports()
classifier.train(classdict, nb_epochs=1000)

The classification is just like other classifiers provided by shorttext:

classifier.score('cancer immunology') # NCI tops the score
classifier.score('children health') # NIAID tops the score
classifier.score('Alzheimer disease and aging') # NIAID tops the score

Continue reading “Short Text Mining using Advanced Keras Layers and Maxent: shorttext 0.4.1”

NSFW Image Classification

At the end of last month, Yahoo opened the sources of training a model to classify not suitable/safe for work (NSFW) images, particularly pornographic images, using convolutional neural network (CNN). It was implemented with Caffe. Users need to supply the training data, positive being the NSFW images, and negative being the suitable/safe for work (SFW) images, to train the model. The model takes an image as the input, and output a score between 0 and 1.

The codes are available on Github, with the README.md about the installation.

tumblr_inline_oebl0inwrm1rilvr1_500

Continue reading “NSFW Image Classification”

Dream of Automation

It is a fantasy for a lot of entrepreneurs, scientists and engineers to develop a software project that can automatically perform feature generation, training, and prediction automatically.

Of course it is a wishful thinking. There is no free lunch.

In big companies that have abundant resources (training data, brains, clusters), they can probably so something like deep learning to get the relevant features, and build classification models. It is almost automatic. It virtually takes no manual addition of human knowledge. Some scientists and engineers are enjoying the strength of word2vec, but it takes a lot of computer resources to even train a word2vec model.

If we do not have enough training data or computing resources, to get a good classifier, we ought to add human knowledge to generate features. We might even need to impose some rules to convert the raw data to sensible features. The rules might be regular expressions, or some calculations, or some filters, or it involves a knowledge database (like WordNet). Things might be simplified if the problem we are dealing with is in a specific domain, that reduces the amount of human knowledge we need to add.

Blog at WordPress.com.

Up ↑