Short Text Mining using Advanced Keras Layers and Maxent: shorttext 0.4.1

On 07/28/2017, shorttext published its release 0.4.1, with a few important updates. To install it, type the following in the OS X / Linux command line:

>>> pip install -U shorttext

The documentation in PythonHosted.org has been abandoned. It has been migrated to readthedocs.org. (URL: http://shorttext.readthedocs.io/ or http:// shorttext.rtfd.io)

Exploiting the Word-Embedding Layer

This update is mainly due to an important update in gensim, motivated by earlier shorttext‘s effort in integrating scikit-learn and keras. And gensim also provides a keras layer, on the same footing as other neural networks, activation function, or dropout layers, for Word2Vec models. Because shorttext has been making use of keras layers for categorization, such advance in gensim in fact makes it a natural step to add an embedding layer of all neural networks provided in shorttext. How to do it? (See shorttext tutorial for “Deep Neural Networks with Word Embedding.”)

import shorttext
wvmodel = shorttext.utils.load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')   # load the pre-trained Word2Vec model
trainclassdict = shorttext.data.subjectkeywords()   # load an example data set

 

To train a model, you can do it the old way, or do it the new way with additional gensim function:

kmodel = shorttext.classifiers.frameworks.CNNWordEmbed(wvmodel=wvmodel, nb_labels=len(trainclassdict.keys()), vecsize=100, with_gensim=True)   # keras model, setting with_gensim=True
classifier = shorttext.classifiers.VarNNEmbeddedVecClassifier(wvmodel, with_gensim=True, vecsize=100)   # instantiate the classifier, setting with_gensim=True
classifier.train(trainclassdict, kmodel)

The parameters with_gensim in both CNNWordEmbed and VarNNEmbeddedVecClassifier are set to be False by default, because of backward compatibility. However, setting it to be True will enable it to use the new gensim Word2Vec layer.

These change in gensim and shorttext are the works mainly contributed by Chinmaya Pancholi, a very bright student at Indian Institute of Technology, Kharagpur, and a GSoC (Google Summer of Code) student in 2017. He revolutionized gensim by integrating scikit-learn and keras into gensim. He also used what he did in gensim to improve the pipelines of shorttext. He provided valuable technical suggestions. You can read his GSoC proposal, and his blog posts in RaRe Technologies, Inc. Chinmaya has been diligently mentored by Ivan Menshikh and Lev Konstantinovskiy of RaRe Technologies.

Maxent Classifier

Another important update is the adding of maximum entropy (maxent) classifier. (See the corresponding tutorial on “Maximum Entropy (MaxEnt) Classifier.”) I will devote a separate entry on the theory, but it is very easy to use it,

import shorttext
from shorttext.classifiers import MaxEntClassifier

classifier = MaxEntClassifier()

Use the NIHReports dataset as the example:

classdict = shorttext.data.nihreports()
classifier.train(classdict, nb_epochs=1000)

The classification is just like other classifiers provided by shorttext:

classifier.score('cancer immunology') # NCI tops the score
classifier.score('children health') # NIAID tops the score
classifier.score('Alzheimer disease and aging') # NIAID tops the score

Continue reading “Short Text Mining using Advanced Keras Layers and Maxent: shorttext 0.4.1”

NLP in Data Science MD

P.S.: this blog entry is long overdue.

On June 16, there was an event held by Data Science MD on natural language processing (NLP). The first speaker was Brian Sacash, a data scientist at Deloitte, and his talk was titled NLP and Sentiment Analysis, which is a good demonstration on the Python package nltk, and its application on sentiment analysis. His approach is knowledge-based, and its quite different from the talk given by Michael Cherny, as presented in his talk in DCNLP and his blog. (See his article.) Brian has a lot of demonstration codes in Jupyter notebook in his Github.

The second speaker was Dr. Daniel Russ, a staff scientist at National Institutes of Health (NIH) and my colleague. His talk was titled It Takes a Village To Solve A Problem in Data Science, stressing the amount of brains and powers involved in solving a data science problem in businesses. He focused on the SOCcer project, (see a previous blog post) which I am also a part of the team, and also the interaction with Apache OpenNLP project. (Slideshare: It Takes a Village To Solve A Problem in Data Science from DataScienceMD)

Continue reading “NLP in Data Science MD”

Blog at WordPress.com.

Up ↑