The formulation of WMD is beautiful. Consider the embedded word vectors , where is the dimension of the embeddings, and is the number of words. For each phrase, there is a normalized BOW vector , and , where ‘s denote the word tokens. The distance between words are the Euclidean distance of their embedded word vectors, denoted by , where and denote word tokens. The document distance, which is WMD here, is defined by , where is a matrix. Each element denote how nuch of word in the first document (denoted by ) travels to word in the new document (denoted by ).

Then the problem becomes the minimization of the document distance, or the WMD, and is formulated as:

,

given the constraints:

, and

.

This is essentially a simplified case of the Earth Mover’s distance (EMD), or the Wasserstein distance. (See the review by Gibbs and Su.)

The WMD is essentially a linear optimization problem. There are many optimization packages on the market, and my stance is that, for those common ones, there are no packages that are superior than others. In my job, I happened to handle a missing data problem, in turn becoming a non-linear optimization problem with linear constraints, and I chose limSolve, after I shop around. But I actually like a lot of other packages too. For WMD problem, I first tried out cvxopt first, which should actually solve the exact same problem, but the indexing is hard to maintain. Because I am dealing with words, it is good to have a direct hash map, or a dictionary. I can use the Dictionary class in gensim. But I later found out I should use PuLP, as it allows indices with words as a hash map (dict in Python), and WMD is a linear programming problem, making PuLP is a perfect choice, considering code efficiency.

An example of using PuLP can be demonstrated by the British 1997 UG Exam, as in the first problem of this link, with the Jupyter Notebook demonstrating this.

The demonstration can be found in the Jupyter Notebook.

Load the necessary packages:

from itertools import product from collections import defaultdict import numpy as np from scipy.spatial.distance import euclidean import pulp import gensim

Then define the functions the gives the BOW document vectors:

def tokens_to_fracdict(tokens): cntdict = defaultdict(lambda : 0) for token in tokens: cntdict[token] += 1 totalcnt = sum(cntdict.values()) return {token: float(cnt)/totalcnt for token, cnt in cntdict.items()}

Then implement the core calculation. Note that PuLP is actually a symbolic computing package. This function return a `pulp.LpProblem`

class:

def word_mover_distance_probspec(first_sent_tokens, second_sent_tokens, wvmodel, lpFile=None): all_tokens = list(set(first_sent_tokens+second_sent_tokens)) wordvecs = {token: wvmodel[token] for token in all_tokens} first_sent_buckets = tokens_to_fracdict(first_sent_tokens) second_sent_buckets = tokens_to_fracdict(second_sent_tokens) T = pulp.LpVariable.dicts('T_matrix', list(product(all_tokens, all_tokens)), lowBound=0) prob = pulp.LpProblem('WMD', sense=pulp.LpMinimize) prob += pulp.lpSum([T[token1, token2]*euclidean(wordvecs[token1], wordvecs[token2]) for token1, token2 in product(all_tokens, all_tokens)]) for token2 in second_sent_buckets: prob += pulp.lpSum([T[token1, token2] for token1 in first_sent_buckets])==second_sent_buckets[token2] for token1 in first_sent_buckets: prob += pulp.lpSum([T[token1, token2] for token2 in second_sent_buckets])==first_sent_buckets[token1] if lpFile!=None: prob.writeLP(lpFile) prob.solve() return prob

To extract the value, just run `pulp.value(prob.objective)`

We use Google Word2Vec. Refer the matrices in the Jupyter Notebook. Running this by a few examples:

- document1 = President, talk, Chicago

document2 = President, speech, Illinois

WMD = 2.88587622936 - document1 = physician, assistant

document2 = doctor

WMD = 2.8760048151 - document1 = physician, assistant

document2 = doctor, assistant

WMD = 1.00465738773

(compare with example 2!) - document1 = doctors, assistant

document2 = doctor, assistant

WMD = 1.02825379372

(compare with example 3!) - document1 = doctor, assistant

document2 = doctor, assistant

WMD = 0.0

(totally identical; compare with example 3!)

There are more examples in the notebook.

WMD is a good metric comparing two documents or sentences, by capturing the semantic meanings of the words. It is more powerful than BOW model as it captures the meaning similarities; it is more powerful than the cosine distance between average word vectors, as the transfer of meaning using words from one document to another is considered. But it is not immune to the problem of misspelling.

This algorithm works well for short texts. However, when the documents become large, this formulation will be computationally expensive. The author actually suggested a few modifications, such as the removal of constraints, and word centroid distances.

Example codes can be found in my Github repository: stephenhky/PyWMD.

- Matt Kusner, Yu Sun, Nicholas Kolkin, Kilian Weinberger, “From Word Embeddings To Document Distances,”
*Proceedings of the 32nd International Conference on Machine Learning*, PMLR 37:957-966 (2015). [PMLR] - Github: mkusner/wmd. [Github]
- Kwan-Yuet Ho, “Toying with Word2Vec,”
*Everything About Data Analytics*, WordPress (2015). [WordPress] - Kwan-Yuet Ho, “On Wasserstein GAN,”
*Everything About Data Analytics*, WordPress (2017). [WordPress] - Martin Arjovsky, Soumith Chintala, Léon Bottou, “Wasserstein GAN,” arXiv:1701.07875 (2017). [arXiv]
- Alison L. Gibbs, Francis Edward Su, “On Choosing and Bounding Probability Metrics,” arXiv:math/0209021 (2002) [arXiv]
- cvxopt: Python Software for Convex Optimization. [HTML]
- gensim: Topic Modeling for Humans. [HTML]
- PuLP: Optimization for Python. [PythonHosted]
- Demonstration of PuLP: Github: stephenhky/PyWMD. [Jupyter]
- Implemenation of WMD: Github: stephenhky/PyWMD. [Jupyter]
- Github: stephenhky/PyWMD. [Github]

Feature image adapted from the original paper by Kusner *et. al.*

]]>

On Aug 9, 2017, Data Science DC held an event titled “Fake News as a Data Science Challenge, ” spoken by Professor Jen Golbeck from University of Maryland. It is an interesting talk.

Fake news itself is a big problem. It has philosophical, social, political, or psychological aspects, but Prof. Golbeck focused on its data science aspect. But to make it a computational problem, a clear and succinct definition of “fake news” has to be present, but it is already challenging. Some “fake news” is pun intended, or sarcasm, or jokes (like The Onion). Some misinformation is shared through Twitter or Facebook not because of deceiving purpose. Then a line to draw is difficult. But the undoubtable part is that we want to fight against news with *malicious intent*.

To fight fake news, as Prof. Golbeck has pointed out, there are three main tasks:

- detecting the content;
- detecting the source; and
- modifying the intent.

Statistical tools can be exploited too. She talked about Benford’s law, which states that, in naturally occurring systems, the frequency of numbers’ first digits is not evenly distributed. Anomaly in the distribution of some news can be used as a first step of fraud detection. (Read her paper.)

There are also efforts, Fake News Challenge for example, in building corpus for fake news, for further machine learning model building.

However, I am not sure fighting fake news is enough. Many Americans are not simply concerned by the prevalence of fake news, but also the narration because of our ideological bias. Sometimes we are not satisfied because we think the news is not “neutral” enough, or, it does not fit our worldview.

The slides can be found here, and the video of the talk can be found here.

- “Fake News as a Data Science Challange,” Data Science DC (Aug 9, 2017). [Meetup] [slides on Google Drive] [Video on Facebook]
- Jennifer Golbeck. [HTML]
- Benford’s Law. [Wikipedia]
- Jennifer Golbeck, “Benford’s Law Applies to Online Social Networks,”
*PLoS ONE*10.8: e0135169 (2015). [PLoS] - Fake News Challenge. [HTML]

Featured image taken from http://www.livingroomconversations.org/fake_news

]]>

`shorttext`

published its release 0.4.1, with a few important updates. To install it, type the following in the OS X / Linux command line:
`>>> pip install -U shorttext`

The documentation in PythonHosted.org has been abandoned. It has been migrated to readthedocs.org. (URL: http://shorttext.readthedocs.io/ or http:// shorttext.rtfd.io)

This update is mainly due to an important update in `gensim`

, motivated by earlier `shorttext`

‘s effort in integrating `scikit-learn`

and `keras`

. And `gensim`

also provides a `keras`

layer, on the same footing as other neural networks, activation function, or dropout layers, for Word2Vec models. Because `shorttext`

has been making use of `keras`

layers for categorization, such advance in `gensim`

in fact makes it a natural step to add an embedding layer of all neural networks provided in `shorttext`

. How to do it? (See `shorttext`

tutorial for “Deep Neural Networks with Word Embedding.”)

import shorttext wvmodel = shorttext.utils.load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz') # load the pre-trained Word2Vec model trainclassdict = shorttext.data.subjectkeywords() # load an example data set

To train a model, you can do it the old way, or do it the new way with additional `gensim`

function:

kmodel = shorttext.classifiers.frameworks.CNNWordEmbed(wvmodel=wvmodel, nb_labels=len(trainclassdict.keys()), vecsize=100, with_gensim=True) # keras model, setting with_gensim=True classifier = shorttext.classifiers.VarNNEmbeddedVecClassifier(wvmodel, with_gensim=True, vecsize=100) # instantiate the classifier, setting with_gensim=True classifier.train(trainclassdict, kmodel)

The parameters `with_gensim`

in both `CNNWordEmbed`

and `VarNNEmbeddedVecClassifier`

are set to be `False`

by default, because of backward compatibility. However, setting it to be `True`

will enable it to use the new `gensim`

Word2Vec layer.

These change in `gensim`

and `shorttext`

are the works mainly contributed by Chinmaya Pancholi, a very bright student at Indian Institute of Technology, Kharagpur, and a GSoC (Google Summer of Code) student in 2017. He revolutionized `gensim`

by integrating `scikit-learn`

and `keras`

into `gensim`

. He also used what he did in `gensim`

to improve the pipelines of `shorttext`

. He provided valuable technical suggestions. You can read his GSoC proposal, and his blog posts in RaRe Technologies, Inc. Chinmaya has been diligently mentored by Ivan Menshikh and Lev Konstantinovskiy of RaRe Technologies.

Another important update is the adding of maximum entropy (maxent) classifier. (See the corresponding tutorial on “Maximum Entropy (MaxEnt) Classifier.”) I will devote a separate entry on the theory, but it is very easy to use it,

import shorttext from shorttext.classifiers import MaxEntClassifier classifier = MaxEntClassifier()

Use the NIHReports dataset as the example:

classdict = shorttext.data.nihreports() classifier.train(classdict, nb_epochs=1000)

The classification is just like other classifiers provided by `shorttext`

:

classifier.score('cancer immunology') # NCI tops the score classifier.score('children health') # NIAID tops the score classifier.score('Alzheimer disease and aging') # NIAID tops the score

`shorttext`

0.4.1. [PyPI]- Documentation of
`shorttext`

. [ReadTheDocs] `gensim`

: Topic Modeling for humans. [RaRe]- Chinmaya Pancholi, “Gensim integration with scikit-learn and Keras,”
*Google Summer of Codes*(GSoC) proposal (2017). [Github] - Chinmaya Pancholi, Student Incubator, Google Summer of Code 2017. [RaRe]
- Adam L. Berger, Stephen A. Della Pietra, Vincent J. Della Pietra, “A Maximum Entropy Approach to Natural Language Processing,” Computational Linguistics 22(1): 39-72 (1996). [ACM]
- Daniel Russ, Kwan-yuet Ho, Melissa Friesen, “It Takes a Village To Solve A Problem in Data Science,” Data Science Maryland, presentation at Applied Physics Laboratory (APL), Johns Hopkins University, on June 19, 2017. (2017) [Slideshare]
- Kwan-Yuet Ho, “Python Package for Short Text Mining,”
*Everything in Data Analytics*, WordPress (2016). [WordPress] - Kwan-Yuet Ho, “Short Text Categorization using Deep Neural Networks and Word-Embedding Models,”
*Everything About Data Analytics*, WordPress (2016). [WordPress]

]]>

On June 16, there was an event held by Data Science MD on natural language processing (NLP). The first speaker was Brian Sacash, a data scientist at Deloitte, and his talk was titled *NLP and Sentiment Analysis*, which is a good demonstration on the Python package nltk, and its application on sentiment analysis. His approach is knowledge-based, and its quite different from the talk given by Michael Cherny, as presented in his talk in DCNLP and his blog. (See his article.) Brian has a lot of demonstration codes in Jupyter notebook in his Github.

The second speaker was Dr. Daniel Russ, a staff scientist at National Institutes of Health (NIH) and my colleague. His talk was titled *It Takes a Village To Solve A Problem in Data Science*, stressing the amount of brains and powers involved in solving a data science problem in businesses. He focused on the SOCcer project, (see a previous blog post) which I am also a part of the team, and also the interaction with Apache OpenNLP project. (Slideshare: **It Takes a Village To Solve A Problem in Data Science ** from **DataScienceMD**)

- Data Science MD. [Meetup]
- Brian Sacash, “Introduction to NLP.” on his Github: bsacash/Introduction-to-NLP. [Github]
- Brian Sacash. [bsacash]
- Natural Language Toolkit: nltk. [nltk]
- Dr. Daniel Russ. [NIH]
- SOCcer. [NIH]
- OpenNLP. [Apache]
- Daniel Russ, Kwan-yuet Ho, Melissa Friesen,
*It Takes a Village To Solve A Problem in Data Science.*[Slideshare] - Kwan-Yuet Ho, “SOCcer: Computerized Coding in Epidemiology,”
*Everything in Data Analytics*, WordPress (2016). [WordPress]

]]>

.

which is an improvement to the existing “elu” function.

Despite this achievement, what caught the eyeballs is not the activation function, but the 93-page appendix of mathematical proof:

And this is one of the pages in the appendix:

Some scholars teased at it on Twitter too:

The “Whoa are you serious” award for an Appendix goes to “Self-Normalizing Neural Networks” https://t.co/YHLDtiKmXv proposes “selu” nonlin

— Andrej Karpathy (@karpathy) June 9, 2017

- Günter Klambauer, Thomas Unterthiner, Andreas Mayr, Sepp Hochreiter, “Self-Normalizing Neural Networks,” arXiv:1706.02515 (2017). [arXiv]
- Github: bioinf-jku/SNNs. [Github]

]]>

I have also described the algorithm of Sammon Embedding, (see this) which attempts to capture the likeliness of pairwise Euclidean distances, and I implemented it using Theano. This blog entry is about its implementation in Tensorflow as a demonstration.

Let’s recall the formalism of Sammon Embedding, as outlined in the previous entry:

Assume there are high dimensional data described by -dimensional vectors, where . And they will be mapped into vectors , with dimensions 2 or 3. Denote the distances to be and . In this problem, are the variables to be learned. The cost function to minimize is

<img class="latex" title="E = \frac{1}{c} \sum_{i<j} \frac{(d_{ij}^{*} – d_{ij})^2}{d_{ij}^{*}}" src="https://s0.wp.com/latex.php?latex=E+%3D+%5Cfrac%7B1%7D%7Bc%7D+%5Csum_%7Bi%3Cj%7D+%5Cfrac%7B%28d_%7Bij%7D%5E%7B%2A%7D+-+d_%7Bij%7D%29%5E2%7D%7Bd_%7Bij%7D%5E%7B%2A%7D%7D&bg=ffffff&fg=888888&s=0" alt="E = \frac{1}{c} \sum_{i,

where <img class="latex" title="c = \sum_{i<j} d_{ij}^{*}" src="https://s0.wp.com/latex.php?latex=c+%3D+%5Csum_%7Bi%3Cj%7D+d_%7Bij%7D%5E%7B%2A%7D&bg=ffffff&fg=888888&s=0" alt="c = \sum_{i.

Unlike in previous entry and original paper, I am going to optimize it using first-order gradient optimizer. If you are not familiar with Tensorflow, take a look at some online articles, for example, “Tensorflow demystified.” This demonstration can be found in this Jupyter Notebook in Github.

First of all, import all the libraries required:

import numpy as np import matplotlib.pyplot as plt import tensorflow as tf

Like previously, we want to use the points clustered around at the four nodes of a tetrahedron as an illustration, which is expected to give equidistant clusters. We sample points around them, as shown:

tetrahedron_points = [np.array([0., 0., 0.]), np.array([1., 0., 0.]), np.array([np.cos(np.pi/3), np.sin(np.pi/3), 0.]), np.array([0.5, 0.5/np.sqrt(3), np.sqrt(2./3.)])] sampled_points = np.concatenate([np.random.multivariate_normal(point, np.eye(3)*0.0001, 10) for point in tetrahedron_points]) init_points = np.concatenate([np.random.multivariate_normal(point[:2], np.eye(2)*0.0001, 10) for point in tetrahedron_points])

Retrieve the number of points, *N*, and the resulting dimension, *d*:

N = sampled_points.shape[0] d = sampled_points.shape[1]

One of the most challenging technical difficulties is to calculate the pairwise distance. Inspired by this StackOverflow thread and Travis Hoppe’s entry on Thomson’s problem, we know it can be computed. Assuming Einstein’s convention of summation over repeated indices, given vectors , the distance matrix is:

,

where the first and last terms are simply the norms of the vectors. After computing the matrix, we will flatten it to vectors, for technical reasons omitted to avoid gradient overflow:

X = tf.placeholder('float') Xshape = tf.shape(X) sqX = tf.reduce_sum(X*X, 1) sqX = tf.reshape(sqX, [-1, 1]) sqDX = sqX - 2*tf.matmul(X, tf.transpose(X)) + tf.transpose(sqX) sqDXarray = tf.stack([sqDX[i, j] for i in range(N) for j in range(i+1, N)]) DXarray = tf.sqrt(sqDXarray) Y = tf.Variable(init_points, dtype='float') sqY = tf.reduce_sum(Y*Y, 1) sqY = tf.reshape(sqY, [-1, 1]) sqDY = sqY - 2*tf.matmul(Y, tf.transpose(Y)) + tf.transpose(sqY) sqDYarray = tf.stack([sqDY[i, j] for i in range(N) for j in range(i+1, N)]) DYarray = tf.sqrt(sqDYarray)

And DXarray and DYarray are the vectorized pairwise distances. Then we defined the cost function according to the definition:

Z = tf.reduce_sum(DXarray)*0.5 numerator = tf.reduce_sum(tf.divide(tf.square(DXarray-DYarray), DXarray))*0.5 cost = tf.divide(numerator, Z)

As we said, we used first-order gradient optimizers. For unknown reasons, the usually well-performing Adam optimizer gives overflow. I then picked Adagrad:

update_rule = tf.assign(Y, Y-0.01*grad_cost/lapl_cost) train = tf.train.AdamOptimizer(0.01).minimize(cost) init = tf.global_variables_initializer()

The last line initializes all variables in the Tensorflow session when it is run. Then start a Tensorflow session, and initialize all variables globally:

sess = tf.Session() sess.run(init)

Then run the algorithm:

nbsteps = 1000 c = sess.run(cost, feed_dict={X: sampled_points}) print "epoch: ", -1, " cost = ", c for i in range(nbsteps): sess.run(train, feed_dict={X: sampled_points}) c = sess.run(cost, feed_dict={X: sampled_points}) print "epoch: ", i, " cost =

Then extract the points and close the Tensorflow session:

calculated_Y = sess.run(Y, feed_dict={X: sampled_points}) sess.close()

Plot it using matplotlib:

embed1, embed2 = calculated_Y.transpose() plt.plot(embed1, embed2, 'ro')

This gives, as expected,

This code for Sammon Embedding has been incorporated into the Python package `mogu`

, which is a collection of numerical routines. You can install it, and call:

from mogu.embed import sammon_embedding calculated_Y = sammon_embedding(sampled_points, init_points)

- Kwan-Yuet Ho, “Sammon Embedding,”
*Everything About Data Analytics*, WordPress (2016). [WordPress] - Kwan-yuet Ho, “Word Embedding Algorithms,”
*Everything about Data Analytics,*WordPress (2016). [WordPress] - Kwan-yuet Ho, “Toying with Word2Vec,”
*Everything about Data Analytics*, WordPress. (2015) [WordPress] - Kwan-yuet Ho, “LDA2Vec: a hybrid of LDA and Word2Vec,”
*Everything about Data Analytics*, WordPress. (2016) [WordPress] - John W. Sammon, Jr., “A Nonlinear Mapping for Data Structure Analysis,”
*IEEE Transactions on Computers***18**, 401-409 (1969). - Wikipedia: Sammon Mapping. [Wikipedia]
- Github repository: stephenhky/SammonEmbedding. [Github]
- Theano. [link]
- NumPy (Numerical Python). [link]
- Laurens van der Maaten, Geoffrey Hinton, “Visualizing Data using t-SNE,”
*Journal of Machine Learning*1, 1-48 (2008). [PDF] - Teuvo Kohonen, “Self-Organizing Maps,” Springer (2000). [Amazon]
- GloVe: Global Vectors for Word Representation. [StanfordNLP]
- Tensorflow.org. [link]
- gk_, “Tensorflow demystified,”
*Medium*. (2017) [Medium] - Notebook of this demonstration can be found at: stephenhky/TensorFlowToyCodes/SammonEmbedding.ipynb. [Jupyter]
- Travis Hoppe, “Stupid Tensorflow tricks,”
*Medium*. (2017) [Medium] - “Compute pairwise distance in a batch without replicating tensor in Tensorflow?” [StackOverflow]
- Sebatian Ruder, “An overview of gradient descent optimization algorithms.” (2016) [Ruder]
- Python package: mogu; [PyPI] Github: stephenhky/MoguNumerics [Github]

]]>

The original seq2seq model is implemented with Long Short-Term Memory (LSTM) model, published by Google.(see their paper) It is basically a character-based model that generates texts according to a sequence of input characters. And the same author constructed a neural conversational model, (see their paper) as mentioned in a previous blog post. Daewoo Chong, from Booz Allen Hamilton, presented its implementation using Tensorflow in DC Data Education Meetup on April 13, 2017. Johns Hopkins also published a spell correction algorithm implemented in seq2seq. (see their paper) The real advantage of RNN over CNN is that there is no limit about the size of the tokens input or output.

While the fixing of the size of vectors for CNN is obvious, using CNN serves the purpose of limiting the size of input vectors, and thus limiting the size of contexts. This limits the contents, and speeds up the training process. RNN is known to be trained slow. Facebook uses this CNN seq2seq model for their machine translation model. For more details, take a look at their paper and their Github repository.

- Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin, “Convolutional Sequence to Sequence Learning.” (2017) [PDF]
- Ilya Sutskever, Oriol Vinyals, Quoc V. Le, “Sequence to Sequence Learning with Neural Networks,” arXiv:1409.3215 (2014). [arXiv]
- Oriol Vinyals, Quoc V. Le, “A Neural Conversational Model,” arXiv:1506.05869 (2015). [arXiv]
- Kwan-Yuet Ho, “Chatbots,”
*Everything About Data Analytics*, WordPress (2015). [WordPress] - “Training a Chatbot with a Recurrent Neural Network,” DC Data Education (April 13, 2017). [Meetup]
- Keisuke Sakaguchi, Kevin Duh, Matt Post, Benjamin Van Durme, “Robsut Wrod Reocginiton via semi-Character Recurrent Neural Network,” arXiv:1608.02214 (2016). [arXiv]
- Github repository: facebookresearch/fairseq. [Github]
- “Facebook提出全新CNN机器翻译：准确度超越谷歌而且还快九倍”, 机器之心 (2017). [Zhihu]

]]>

This package `shorttext‘ was designed to tackle all these problems… It contains the following features:

- example data provided (including subject keywords and NIH RePORT);
- text preprocessing;
- pre-trained word-embedding support;
- gensim topic models (LDA, LSI, Random Projections) and autoencoder;
- topic model representation supported for supervised learning using scikit-learn;
- cosine distance classification; and
- neural network classification (including ConvNet, and C-LSTM).

And since the first version, there have been updates, as summarized in the documention (News):

## Version 0.3.3 (Apr 19, 2017)

- Deleted CNNEmbedVecClassifier.
- Added script ShortTextWord2VecSimilarity.
## Version 0.3.2 (Mar 28, 2017)

- Bug fixed for gensim model I/O;
- Console scripts update;
- Neural networks up to Keras 2 standard (refer to this).
## Version 0.3.1 (Mar 14, 2017)

- Compact model I/O: all models are in single files;
- Implementation of stacked generalization using logistic regression.
## Version 0.2.1 (Feb 23, 2017)

- Removal attempts of loading GloVe model, as it can be run using gensim script;
- Confirmed compatibility of the package with tensorflow;
- Use of spacy for tokenization, instead of nltk;
- Use of stemming for Porter stemmer, instead of nltk;
- Removal of nltk dependencies;
- Simplifying the directory and module structures;
- Module packages updated.

Although there are still additions that I would love to add, but it would not change the overall architecture. I may add some more supervised learning algorithms, but under the same network. The upcoming big additions will be generative models or seq2seq models, but I do not see them coming in the short term. I will add corpuses.

I may add tutorials if I have time.

I am thankful that there is probably some external collaboration with other Python packages. Some people have already made some useful contributions. It will be updated if more things are confirmed.

- Python package: shorttext. [PyPI]
- Package documentation. [PythonHosted]
- Github: stephenhky/shorttext. [Github]
- “Short Text Categorization using Deep Neural Networks and Word-Embedding Models,”
*Everything About Data Analytics*, WordPress (2016). [WordPress] - “Release of shorttext 0.2.1,” Everything About Data Analytics, WordPress (2017). [WordPress]

]]>

Generative models are not new. Topic models such as LDA, or STM are generative models. However, I have been using the topic vectors or other topic models such as LDA2Vec as the feature of another supervised algorithm. And it is basically the design of my shorttext package.

I attended a meetup event held by DC Data Science and Data Education DC. The speaker, Daewoo Chong, is a senior Data Scientist at Booz Allen Hamilton. He talked about chatbot, building on RNN models on characters. His talk was not exactly about generative models, but it is indeed about generating texts. With the sophistication of GANs (see my entry on GAN and WGAN), it will surely be my next focus of my toy projects.

Ran Chen wrote a blog on his company homepage about natural language generation in his system, Trulia.

And there are a few GAN applications on text:

- “Generating Text via Adversarial Learning” [PDF]
- Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu, “SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient,” arXiv:1609.05473 [arXiv]
- Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, Dan Jurafsky, “Adversarial Learning for Neural Dialogue Generation,” arXiv:1701.06547 [arXiv]
- Matt J. Kusner, José Miguel Hernández-Lobato, “GANs for sequence of discrete elements with the Gumbel-softmax distribution,” arXiv:1611.04051 [arXiv]
- David Pfau, Oriol Vinyals, “Connecting generative adversarial network and actor-critic methods,” arXiv:1610.01945 [arXiv]
- Xuerong Xiao, “Text Generation usingGenerative Adversarial Training” [PDF]

]]>

In their paper (arXiv:1701.07875), they proposed the following to improve GAN:

- do not use sigmoid function as the activation function;
- the cost functions of the discriminator and generator must not have logarithms;
- cap the parameters at each step of training; and
- do not use momentum-based optimizer such as momentum or Adam; instead, RMSprop or SGD are recommended.

These do not have a theoretical reason, but rather empirical. However, the most important change is to use the Wasserstein distance as the cost function, which the authors explained in detail. There are many metrics to measure the differences between probability distributions, as summarized by Gibbs and Su in the paper, (arXiv:math/0209021) the authors of Wasserstein GAN discussed four of them, namely, the total variation (TV) distance, the Kullback-Leibler (KL) divergence, the Jensen-Shannon (JS) divergence, and the Earth-Mover (EM, Wasserstein) distance. They used an example of two parallel uniform distributions in a line to illustrate that only the EM (or Wasserstein) distance captured the continuity of the distribution distances, solving the problem of zero-change when the derivative of the generator becoming too small. The EM distance indicates, intuitively, how much “mass” must be transported from one distribution to another.

Formally, the EM distance is

,

and the training involves finding the optimal transport path .

Unfortunately, the EM distance cannot be computed directly from the definition. However, as an optimization problem, there is a dual problem corresponding to this. While the author did not explain too much, Vincent Herrmann explained about the dual problem in detail in his blog.

The algorithm is described as follow:

- Martin Arjovsky, Soumith Chintala, Léon Bottou, “Wasserstein GAN,” arXiv:1701.07875 (2017). [arXiv]
- Kwan-Yuet Ho, “Generative Adversarial Networks,”
*Everything about Data Analytics*, WordPress (2017). [WordPress] - Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, “Generative Adversarial Networks,” arXiv:1406.2661 (2014). [arXiv]
- Ian Goodfellow, “NIPS 2016 Tutorial: Generative Adversarial Networks,” arXiv:1701.00160 (2017). [arXiv]
- 郑华滨, “令人拍案叫绝的Wasserstein GAN,” Zhihu Zhuanlan (2017). [Zhihu] (in Chinese (simplified))
- Alison L. Gibbs, Francis Edward Su, “On Choosing and Bounding Probability Metrics,” arXiv:math/0209021 (2002) [arXiv]
- Vincent Herrmann, “Wasserstein GAN and the Kantorovich-Rubinstein Duality.” (2017) [GithubIO]

]]>