People have been upset about the prevalence of fake news since the election season last year. Election has been a year, but fake news is still around because the society is still politically charged. Some tech companies vowed to fight against fake news, but, easy to imagine, this is a tough task.

On Aug 9, 2017, Data Science DC held an event titled “Fake News as a Data Science Challenge, ” spoken by Professor Jen Golbeck from University of Maryland. It is an interesting talk.

Fake news itself is a big problem. It has philosophical, social, political, or psychological aspects, but Prof. Golbeck focused on its data science aspect. But to make it a computational problem, a clear and succinct definition of “fake news” has to be present, but it is already challenging. Some “fake news” is pun intended, or sarcasm, or jokes (like The Onion). Some misinformation is shared through Twitter or Facebook not because of deceiving purpose. Then a line to draw is difficult. But the undoubtable part is that we want to fight against news with malicious intent.

To fight fake news, as Prof. Golbeck has pointed out, there are three main tasks:

1. detecting the content;
2. detecting the source; and
3. modifying the intent.

Statistical tools can be exploited too. She talked about Benford’s law, which states that, in naturally occurring systems, the frequency of numbers’ first digits is not evenly distributed. Anomaly in the distribution of some news can be used as a first step of fraud detection. (Read her paper.)

There are also efforts, Fake News Challenge for example, in building corpus for fake news, for further machine learning model building.

However, I am not sure fighting fake news is enough. Many Americans are not simply concerned by the prevalence of fake news, but also the narration because of our ideological bias. Sometimes we are not satisfied because we think the news is not “neutral” enough, or, it does not fit our worldview.

The slides can be found here, and the video of the talk can be found here.

On 07/28/2017, shorttext published its release 0.4.1, with a few important updates. To install it, type the following in the OS X / Linux command line:

>>> pip install -U shorttext

The documentation in PythonHosted.org has been abandoned. It has been migrated to readthedocs.org. (URL: http://shorttext.readthedocs.io/ or http:// shorttext.rtfd.io)

## Exploiting the Word-Embedding Layer

This update is mainly due to an important update in gensim, motivated by earlier shorttext‘s effort in integrating scikit-learn and keras. And gensim also provides a keras layer, on the same footing as other neural networks, activation function, or dropout layers, for Word2Vec models. Because shorttext has been making use of keras layers for categorization, such advance in gensim in fact makes it a natural step to add an embedding layer of all neural networks provided in shorttext. How to do it? (See shorttext tutorial for “Deep Neural Networks with Word Embedding.”)

import shorttext
trainclassdict = shorttext.data.subjectkeywords() &nbsp; # load an example data set


To train a model, you can do it the old way, or do it the new way with additional gensim function:

kmodel = shorttext.classifiers.frameworks.CNNWordEmbed(wvmodel=wvmodel, nb_labels=len(trainclassdict.keys()), vecsize=100, with_gensim=True) &nbsp; # keras model, setting with_gensim=True
classifier = shorttext.classifiers.VarNNEmbeddedVecClassifier(wvmodel, with_gensim=True, vecsize=100) &nbsp; # instantiate the classifier, setting with_gensim=True
classifier.train(trainclassdict, kmodel)


The parameters with_gensim in both CNNWordEmbed and VarNNEmbeddedVecClassifier are set to be False by default, because of backward compatibility. However, setting it to be True will enable it to use the new gensim Word2Vec layer.

These change in gensim and shorttext are the works mainly contributed by Chinmaya Pancholi, a very bright student at Indian Institute of Technology, Kharagpur, and a GSoC (Google Summer of Code) student in 2017. He revolutionized gensim by integrating scikit-learn and keras into gensim. He also used what he did in gensim to improve the pipelines of shorttext. He provided valuable technical suggestions. You can read his GSoC proposal, and his blog posts in RaRe Technologies, Inc. Chinmaya has been diligently mentored by Ivan Menshikh and Lev Konstantinovskiy of RaRe Technologies.

## Maxent Classifier

Another important update is the adding of maximum entropy (maxent) classifier. (See the corresponding tutorial on “Maximum Entropy (MaxEnt) Classifier.”) I will devote a separate entry on the theory, but it is very easy to use it,

import shorttext
from shorttext.classifiers import MaxEntClassifier

classifier = MaxEntClassifier()


Use the NIHReports dataset as the example:

classdict = shorttext.data.nihreports()
classifier.train(classdict, nb_epochs=1000)


The classification is just like other classifiers provided by shorttext:

classifier.score('cancer immunology') # NCI tops the score
classifier.score('children health') # NIAID tops the score
classifier.score('Alzheimer disease and aging') # NIAID tops the score


P.S.: this blog entry is long overdue.

On June 16, there was an event held by Data Science MD on natural language processing (NLP). The first speaker was Brian Sacash, a data scientist at Deloitte, and his talk was titled NLP and Sentiment Analysis, which is a good demonstration on the Python package nltk, and its application on sentiment analysis. His approach is knowledge-based, and its quite different from the talk given by Michael Cherny, as presented in his talk in DCNLP and his blog. (See his article.) Brian has a lot of demonstration codes in Jupyter notebook in his Github.

The second speaker was Dr. Daniel Russ, a staff scientist at National Institutes of Health (NIH) and my colleague. His talk was titled It Takes a Village To Solve A Problem in Data Science, stressing the amount of brains and powers involved in solving a data science problem in businesses. He focused on the SOCcer project, (see a previous blog post) which I am also a part of the team, and also the interaction with Apache OpenNLP project. (Slideshare: It Takes a Village To Solve A Problem in Data Science from DataScienceMD)

A preprint on arXiv recently caught a lot of attentions. While deep learning is successful in various types of neural networks, it had not been so for feed-forward neural networks. The authors of this paper proposed normalizing the network with a new activation function, called “selu” (scaled exponential linear units):

$\text{selu}(x) =\lambda \left\{ \begin{array}{cc} x & \text{if } x>0 \\ \alpha e^x - \alpha & \text{if } x \leq 0 \end{array} \right.$.

which is an improvement to the existing “elu” function.

Despite this achievement, what caught the eyeballs is not the activation function, but the 93-page appendix of mathematical proof:

And this is one of the pages in the appendix:

Some scholars teased at it on Twitter too:

Embedding algorithms, especially word-embedding algorithms, have been one of the recurrent themes of this blog. Word2Vec has been mentioned in a few entries (see this); LDA2Vec has been covered (see this); the mathematical principle of GloVe has been elaborated (see this); I haven’t even covered Facebook’s fasttext; and I have not explained the widely used t-SNE and Kohonen’s map (self-organizing map, SOM).

I have also described the algorithm of Sammon Embedding, (see this) which attempts to capture the likeliness of pairwise Euclidean distances, and I implemented it using Theano. This blog entry is about its implementation in Tensorflow as a demonstration.

Let’s recall the formalism of Sammon Embedding, as outlined in the previous entry:

Assume there are high dimensional data described by $d$-dimensional vectors, $X_i$ where $i=1, 2, \ldots, N$. And they will be mapped into vectors $Y_i$, with dimensions 2 or 3. Denote the distances to be $d_{ij}^{*} = \sqrt{| X_i - X_j|^2}$ and $d_{ij} = \sqrt{| Y_i - Y_j|^2}$. In this problem, $Y_i$ are the variables to be learned. The cost function to minimize is

$E = \frac{1}{c} \sum_{i,

where $c = \sum_{i.

Unlike in previous entry and original paper, I am going to optimize it using first-order gradient optimizer. If you are not familiar with Tensorflow, take a look at some online articles, for example, “Tensorflow demystified.” This demonstration can be found in this Jupyter Notebook in Github.

First of all, import all the libraries required:

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf


Like previously, we want to use the points clustered around at the four nodes of a tetrahedron as an illustration, which is expected to give equidistant clusters. We sample points around them, as shown:

tetrahedron_points = [np.array([0., 0., 0.]), np.array([1., 0., 0.]), np.array([np.cos(np.pi/3), np.sin(np.pi/3), 0.]), np.array([0.5, 0.5/np.sqrt(3), np.sqrt(2./3.)])]

sampled_points = np.concatenate([np.random.multivariate_normal(point, np.eye(3)*0.0001, 10) for point in tetrahedron_points])

init_points = np.concatenate([np.random.multivariate_normal(point[:2], np.eye(2)*0.0001, 10) for point in tetrahedron_points])


Retrieve the number of points, N, and the resulting dimension, d:

N = sampled_points.shape[0]
d = sampled_points.shape[1]


One of the most challenging technical difficulties is to calculate the pairwise distance. Inspired by this StackOverflow thread and Travis Hoppe’s entry on Thomson’s problem, we know it can be computed. Assuming Einstein’s convention of summation over repeated indices, given vectors $a_{ik}$, the distance matrix is:

$D_{ij} = (a_{ik}-a_{jk}) (a_{ik} - a_{jk})^T = a_{ik} a_{ik} + a_{jk} a_{jk} - 2 a_{ik} a_{jk}$,

where the first and last terms are simply the norms of the vectors. After computing the matrix, we will flatten it to vectors, for technical reasons omitted to avoid gradient overflow:

X = tf.placeholder('float')
Xshape = tf.shape(X)

sqX = tf.reduce_sum(X*X, 1)
sqX = tf.reshape(sqX, [-1, 1])
sqDX = sqX - 2*tf.matmul(X, tf.transpose(X)) + tf.transpose(sqX)
sqDXarray = tf.stack([sqDX[i, j] for i in range(N) for j in range(i+1, N)])
DXarray = tf.sqrt(sqDXarray)

Y = tf.Variable(init_points, dtype='float')
sqY = tf.reduce_sum(Y*Y, 1)
sqY = tf.reshape(sqY, [-1, 1])
sqDY = sqY - 2*tf.matmul(Y, tf.transpose(Y)) + tf.transpose(sqY)
sqDYarray = tf.stack([sqDY[i, j] for i in range(N) for j in range(i+1, N)])
DYarray = tf.sqrt(sqDYarray)


And DXarray and DYarray are the vectorized pairwise distances. Then we defined the cost function according to the definition:

Z = tf.reduce_sum(DXarray)*0.5
numerator = tf.reduce_sum(tf.divide(tf.square(DXarray-DYarray), DXarray))*0.5
cost = tf.divide(numerator, Z)


update_rule = tf.assign(Y, Y-0.01*grad_cost/lapl_cost)
init = tf.global_variables_initializer()


The last line initializes all variables in the Tensorflow session when it is run. Then start a Tensorflow session, and initialize all variables globally:

sess = tf.Session()
sess.run(init)


Then run the algorithm:

nbsteps = 1000
c = sess.run(cost, feed_dict={X: sampled_points})
print "epoch: ", -1, " cost = ", c
for i in range(nbsteps):
sess.run(train, feed_dict={X: sampled_points})
c = sess.run(cost, feed_dict={X: sampled_points})
print "epoch: ", i, " cost =


Then extract the points and close the Tensorflow session:

calculated_Y = sess.run(Y, feed_dict={X: sampled_points})
sess.close()


Plot it using matplotlib:

embed1, embed2 = calculated_Y.transpose()
plt.plot(embed1, embed2, 'ro')


This gives, as expected,

This code for Sammon Embedding has been incorporated into the Python package mogu, which is a collection of numerical routines. You can install it, and call:

from mogu.embed import sammon_embedding
calculated_Y = sammon_embedding(sampled_points, init_points)


In these few days, Facebook published a new research paper, regarding the use of sequence to sequence (seq2seq) model for machine translation. What is special about this seq2seq model is that it uses convolutional neural networks (ConvNet, or CNN), instead of recurrent neural networks (RNN).

The original seq2seq model is implemented with Long Short-Term Memory (LSTM) model, published by Google.(see their paper) It is basically a character-based model that generates texts according to a sequence of input characters. And the same author constructed a neural conversational model, (see their paper) as mentioned in a previous blog post. Daewoo Chong, from Booz Allen Hamilton, presented its implementation using Tensorflow in DC Data Education Meetup on April 13, 2017. Johns Hopkins also published a spell correction algorithm implemented in seq2seq. (see their paper) The real advantage of RNN over CNN is that there is no limit about the size of the tokens input or output.

While the fixing of the size of vectors for CNN is obvious, using CNN serves the purpose of limiting the size of input vectors, and thus limiting the size of contexts. This limits the contents, and speeds up the training process. RNN is known to be trained slow. Facebook uses this CNN seq2seq model for their machine translation model. For more details, take a look at their paper and their Github repository.

On November 21, 2016, the Python package shorttext’ was published. Until today, more than seven versions have been published. There have been a drastic architecture change, but the overall purpose is still the same, as summarized in the first introduction entry:

This package shorttext‘ was designed to tackle all these problems… It contains the following features:

• example data provided (including subject keywords and NIH RePORT);
• text preprocessing;
• pre-trained word-embedding support;
• gensim topic models (LDA, LSI, Random Projections) and autoencoder;
• topic model representation supported for supervised learning using scikit-learn;
• cosine distance classification; and
• neural network classification (including ConvNet, and C-LSTM).

And since the first version, there have been updates, as summarized in the documention (News):

## Version 0.3.3 (Apr 19, 2017)

• Deleted CNNEmbedVecClassifier.

## Version 0.3.2 (Mar 28, 2017)

• Bug fixed for gensim model I/O;
• Console scripts update;
• Neural networks up to Keras 2 standard (refer to this).

## Version 0.3.1 (Mar 14, 2017)

• Compact model I/O: all models are in single files;
• Implementation of stacked generalization using logistic regression.

## Version 0.2.1 (Feb 23, 2017)

• Removal attempts of loading GloVe model, as it can be run using gensim script;
• Confirmed compatibility of the package with tensorflow;
• Use of spacy for tokenization, instead of nltk;
• Use of stemming for Porter stemmer, instead of nltk;
• Removal of nltk dependencies;
• Simplifying the directory and module structures;
• Module packages updated.

Although there are still additions that I would love to add, but it would not change the overall architecture. I may add some more supervised learning algorithms, but under the same network. The upcoming big additions will be generative models or seq2seq models, but I do not see them coming in the short term. I will add corpuses.

I may add tutorials if I have time.

I am thankful that there is probably some external collaboration with other Python packages. Some people have already made some useful contributions. It will be updated if more things are confirmed.

I have worked a lot on text categorization in the past few months, and I started to get bored. I started to become more interested in generative models, and generating texts.

Generative models are not new. Topic models such as LDA, or STM are generative models. However, I have been using the topic vectors or other topic models such as LDA2Vec as the feature of another supervised algorithm. And it is basically the design of my shorttext package.

I attended a meetup event held by DC Data Science and Data Education DC. The speaker, Daewoo Chong, is a senior Data Scientist at Booz Allen Hamilton. He talked about chatbot, building on RNN models on characters. His talk was not exactly about generative models, but it is indeed about generating texts. With the sophistication of GANs (see my entry on GAN and WGAN), it will surely be my next focus of my toy projects.

Ran Chen wrote a blog on his company homepage about natural language generation in his system, Trulia.

And there are a few GAN applications on text:

• “Generating Text via Adversarial Learning” [PDF]
• Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu, “SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient,” arXiv:1609.05473 [arXiv]
• Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, Dan Jurafsky, “Adversarial Learning for Neural Dialogue Generation,” arXiv:1701.06547 [arXiv]
• Matt J. Kusner, José Miguel Hernández-Lobato, “GANs for sequence of discrete elements with the Gumbel-softmax distribution,” arXiv:1611.04051 [arXiv]
• David Pfau, Oriol Vinyals, “Connecting generative adversarial network and actor-critic methods,” arXiv:1610.01945 [arXiv]
• Xuerong Xiao, “Text Generation usingGenerative Adversarial Training” [PDF]

A few weeks ago, I introduced the generative model called generative adversarial networks (GAN), and stated the difficulties of training it. Not long after the post, a group of scientists from Facebook and Courant introduced Wasserstein GAN, which uses Wasserstein distance, or the Earth Mover (EM) distance, instead of Jensen-Shannon (JS) divergence as the final cost function.

In their paper (arXiv:1701.07875), they proposed the following to improve GAN:

• do not use sigmoid function as the activation function;
• the cost functions of the discriminator and generator must not have logarithms;
• cap the parameters at each step of training; and
• do not use momentum-based optimizer such as momentum or Adam; instead, RMSprop or SGD are recommended.

These do not have a theoretical reason, but rather empirical. However, the most important change is to use the Wasserstein distance as the cost function, which the authors explained in detail. There are many metrics to measure the differences between probability distributions, as summarized by Gibbs and Su in the paper, (arXiv:math/0209021) the authors of Wasserstein GAN discussed four of them, namely, the total variation (TV) distance, the Kullback-Leibler (KL) divergence, the Jensen-Shannon (JS) divergence, and the Earth-Mover (EM, Wasserstein) distance. They used an example of two parallel uniform distributions in a line to illustrate that only the EM (or Wasserstein) distance captured the continuity of the distribution distances, solving the problem of zero-change when the derivative of the generator becoming too small. The EM distance indicates, intuitively, how much “mass” must be transported from one distribution to another.

Formally, the EM distance is

$W(\mathbb{P}_r, \mathbb{P}_g) = \inf_{\gamma \in (\mathbb{P}_r, \mathbb{P}_g)} \mathbb{E}_{(x, y) \sim \gamma} [ || x-y||]$,

and the training involves finding the optimal transport path $\gamma$.

Unfortunately, the EM distance cannot be computed directly from the definition. However, as an optimization problem, there is a dual problem corresponding to this. While the author did not explain too much, Vincent Herrmann explained about the dual problem in detail in his blog.

The algorithm is described as follow: