.

which is an improvement to the existing “elu” function.

Despite this achievement, what caught the eyeballs is not the activation function, but the 93-page appendix of mathematical proof:

And this is one of the pages in the appendix:

Some scholars teased at it on Twitter too:

The “Whoa are you serious” award for an Appendix goes to “Self-Normalizing Neural Networks” https://t.co/YHLDtiKmXv proposes “selu” nonlin

— Andrej Karpathy (@karpathy) June 9, 2017

- Günter Klambauer, Thomas Unterthiner, Andreas Mayr, Sepp Hochreiter, “Self-Normalizing Neural Networks,” arXiv:1706.02515 (2017). [arXiv]
- Github: bioinf-jku/SNNs. [Github]

]]>

I have also described the algorithm of Sammon Embedding, (see this) which attempts to capture the likeliness of pairwise Euclidean distances, and I implemented it using Theano. This blog entry is about its implementation in Tensorflow as a demonstration.

Let’s recall the formalism of Sammon Embedding, as outlined in the previous entry:

Assume there are high dimensional data described by -dimensional vectors, where . And they will be mapped into vectors , with dimensions 2 or 3. Denote the distances to be and . In this problem, are the variables to be learned. The cost function to minimize is

<img class="latex" title="E = \frac{1}{c} \sum_{i<j} \frac{(d_{ij}^{*} – d_{ij})^2}{d_{ij}^{*}}" src="https://s0.wp.com/latex.php?latex=E+%3D+%5Cfrac%7B1%7D%7Bc%7D+%5Csum_%7Bi%3Cj%7D+%5Cfrac%7B%28d_%7Bij%7D%5E%7B%2A%7D+-+d_%7Bij%7D%29%5E2%7D%7Bd_%7Bij%7D%5E%7B%2A%7D%7D&bg=ffffff&fg=888888&s=0" alt="E = \frac{1}{c} \sum_{i,

where <img class="latex" title="c = \sum_{i<j} d_{ij}^{*}" src="https://s0.wp.com/latex.php?latex=c+%3D+%5Csum_%7Bi%3Cj%7D+d_%7Bij%7D%5E%7B%2A%7D&bg=ffffff&fg=888888&s=0" alt="c = \sum_{i.

Unlike in previous entry and original paper, I am going to optimize it using first-order gradient optimizer. If you are not familiar with Tensorflow, take a look at some online articles, for example, “Tensorflow demystified.” This demonstration can be found in this Jupyter Notebook in Github.

First of all, import all the libraries required:

import numpy as np import matplotlib.pyplot as plt import tensorflow as tf

Like previously, we want to use the points clustered around at the four nodes of a tetrahedron as an illustration, which is expected to give equidistant clusters. We sample points around them, as shown:

tetrahedron_points = [np.array([0., 0., 0.]), np.array([1., 0., 0.]), np.array([np.cos(np.pi/3), np.sin(np.pi/3), 0.]), np.array([0.5, 0.5/np.sqrt(3), np.sqrt(2./3.)])] sampled_points = np.concatenate([np.random.multivariate_normal(point, np.eye(3)*0.0001, 10) for point in tetrahedron_points]) init_points = np.concatenate([np.random.multivariate_normal(point[:2], np.eye(2)*0.0001, 10) for point in tetrahedron_points])

Retrieve the number of points, *N*, and the resulting dimension, *d*:

N = sampled_points.shape[0] d = sampled_points.shape[1]

One of the most challenging technical difficulties is to calculate the pairwise distance. Inspired by this StackOverflow thread and Travis Hoppe’s entry on Thomson’s problem, we know it can be computed. Assuming Einstein’s convention of summation over repeated indices, given vectors , the distance matrix is:

,

where the first and last terms are simply the norms of the vectors. After computing the matrix, we will flatten it to vectors, for technical reasons omitted to avoid gradient overflow:

X = tf.placeholder('float') Xshape = tf.shape(X) sqX = tf.reduce_sum(X*X, 1) sqX = tf.reshape(sqX, [-1, 1]) sqDX = sqX - 2*tf.matmul(X, tf.transpose(X)) + tf.transpose(sqX) sqDXarray = tf.stack([sqDX[i, j] for i in range(N) for j in range(i+1, N)]) DXarray = tf.sqrt(sqDXarray) Y = tf.Variable(init_points, dtype='float') sqY = tf.reduce_sum(Y*Y, 1) sqY = tf.reshape(sqY, [-1, 1]) sqDY = sqY - 2*tf.matmul(Y, tf.transpose(Y)) + tf.transpose(sqY) sqDYarray = tf.stack([sqDY[i, j] for i in range(N) for j in range(i+1, N)]) DYarray = tf.sqrt(sqDYarray)

And DXarray and DYarray are the vectorized pairwise distances. Then we defined the cost function according to the definition:

Z = tf.reduce_sum(DXarray)*0.5 numerator = tf.reduce_sum(tf.divide(tf.square(DXarray-DYarray), DXarray))*0.5 cost = tf.divide(numerator, Z)

As we said, we used first-order gradient optimizers. For unknown reasons, the usually well-performing Adam optimizer gives overflow. I then picked Adagrad:

update_rule = tf.assign(Y, Y-0.01*grad_cost/lapl_cost) train = tf.train.AdamOptimizer(0.01).minimize(cost) init = tf.global_variables_initializer()

The last line initializes all variables in the Tensorflow session when it is run. Then start a Tensorflow session, and initialize all variables globally:

sess = tf.Session() sess.run(init)

Then run the algorithm:

nbsteps = 1000 c = sess.run(cost, feed_dict={X: sampled_points}) print "epoch: ", -1, " cost = ", c for i in range(nbsteps): sess.run(train, feed_dict={X: sampled_points}) c = sess.run(cost, feed_dict={X: sampled_points}) print "epoch: ", i, " cost =

Then extract the points and close the Tensorflow session:

calculated_Y = sess.run(Y, feed_dict={X: sampled_points}) sess.close()

Plot it using matplotlib:

embed1, embed2 = calculated_Y.transpose() plt.plot(embed1, embed2, 'ro')

This gives, as expected,

This code for Sammon Embedding has been incorporated into the Python package `mogu`

, which is a collection of numerical routines. You can install it, and call:

from mogu.embed import sammon_embedding calculated_Y = sammon_embedding(sampled_points, init_points)

- Kwan-Yuet Ho, “Sammon Embedding,”
*Everything About Data Analytics*, WordPress (2016). [WordPress] - Kwan-yuet Ho, “Word Embedding Algorithms,”
*Everything about Data Analytics,*WordPress (2016). [WordPress] - Kwan-yuet Ho, “Toying with Word2Vec,”
*Everything about Data Analytics*, WordPress. (2015) [WordPress] - Kwan-yuet Ho, “LDA2Vec: a hybrid of LDA and Word2Vec,”
*Everything about Data Analytics*, WordPress. (2016) [WordPress] - John W. Sammon, Jr., “A Nonlinear Mapping for Data Structure Analysis,”
*IEEE Transactions on Computers***18**, 401-409 (1969). - Wikipedia: Sammon Mapping. [Wikipedia]
- Github repository: stephenhky/SammonEmbedding. [Github]
- Theano. [link]
- NumPy (Numerical Python). [link]
- Laurens van der Maaten, Geoffrey Hinton, “Visualizing Data using t-SNE,”
*Journal of Machine Learning*1, 1-48 (2008). [PDF] - Teuvo Kohonen, “Self-Organizing Maps,” Springer (2000). [Amazon]
- GloVe: Global Vectors for Word Representation. [StanfordNLP]
- Tensorflow.org. [link]
- gk_, “Tensorflow demystified,”
*Medium*. (2017) [Medium] - Notebook of this demonstration can be found at: stephenhky/TensorFlowToyCodes/SammonEmbedding.ipynb. [Jupyter]
- Travis Hoppe, “Stupid Tensorflow tricks,”
*Medium*. (2017) [Medium] - “Compute pairwise distance in a batch without replicating tensor in Tensorflow?” [StackOverflow]
- Sebatian Ruder, “An overview of gradient descent optimization algorithms.” (2016) [Ruder]
- Python package: mogu; [PyPI] Github: stephenhky/MoguNumerics [Github]

]]>

The original seq2seq model is implemented with Long Short-Term Memory (LSTM) model, published by Google.(see their paper) It is basically a character-based model that generates texts according to a sequence of input characters. And the same author constructed a neural conversational model, (see their paper) as mentioned in a previous blog post. Daewoo Chong, from Booz Allen Hamilton, presented its implementation using Tensorflow in DC Data Education Meetup on April 13, 2017. Johns Hopkins also published a spell correction algorithm implemented in seq2seq. (see their paper) The real advantage of RNN over CNN is that there is no limit about the size of the tokens input or output.

While the fixing of the size of vectors for CNN is obvious, using CNN serves the purpose of limiting the size of input vectors, and thus limiting the size of contexts. This limits the contents, and speeds up the training process. RNN is known to be trained slow. Facebook uses this CNN seq2seq model for their machine translation model. For more details, take a look at their paper and their Github repository.

- Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin, “Convolutional Sequence to Sequence Learning.” (2017) [PDF]
- Ilya Sutskever, Oriol Vinyals, Quoc V. Le, “Sequence to Sequence Learning with Neural Networks,” arXiv:1409.3215 (2014). [arXiv]
- Oriol Vinyals, Quoc V. Le, “A Neural Conversational Model,” arXiv:1506.05869 (2015). [arXiv]
- Kwan-Yuet Ho, “Chatbots,”
*Everything About Data Analytics*, WordPress (2015). [WordPress] - “Training a Chatbot with a Recurrent Neural Network,” DC Data Education (April 13, 2017). [Meetup]
- Keisuke Sakaguchi, Kevin Duh, Matt Post, Benjamin Van Durme, “Robsut Wrod Reocginiton via semi-Character Recurrent Neural Network,” arXiv:1608.02214 (2016). [arXiv]
- Github repository: facebookresearch/fairseq. [Github]
- “Facebook提出全新CNN机器翻译：准确度超越谷歌而且还快九倍”, 机器之心 (2017). [Zhihu]

]]>

This package `shorttext‘ was designed to tackle all these problems… It contains the following features:

- example data provided (including subject keywords and NIH RePORT);
- text preprocessing;
- pre-trained word-embedding support;
- gensim topic models (LDA, LSI, Random Projections) and autoencoder;
- topic model representation supported for supervised learning using scikit-learn;
- cosine distance classification; and
- neural network classification (including ConvNet, and C-LSTM).

And since the first version, there have been updates, as summarized in the documention (News):

## Version 0.3.3 (Apr 19, 2017)

- Deleted CNNEmbedVecClassifier.
- Added script ShortTextWord2VecSimilarity.
## Version 0.3.2 (Mar 28, 2017)

- Bug fixed for gensim model I/O;
- Console scripts update;
- Neural networks up to Keras 2 standard (refer to this).
## Version 0.3.1 (Mar 14, 2017)

- Compact model I/O: all models are in single files;
- Implementation of stacked generalization using logistic regression.
## Version 0.2.1 (Feb 23, 2017)

- Removal attempts of loading GloVe model, as it can be run using gensim script;
- Confirmed compatibility of the package with tensorflow;
- Use of spacy for tokenization, instead of nltk;
- Use of stemming for Porter stemmer, instead of nltk;
- Removal of nltk dependencies;
- Simplifying the directory and module structures;
- Module packages updated.

Although there are still additions that I would love to add, but it would not change the overall architecture. I may add some more supervised learning algorithms, but under the same network. The upcoming big additions will be generative models or seq2seq models, but I do not see them coming in the short term. I will add corpuses.

I may add tutorials if I have time.

I am thankful that there is probably some external collaboration with other Python packages. Some people have already made some useful contributions. It will be updated if more things are confirmed.

- Python package: shorttext. [PyPI]
- Package documentation. [PythonHosted]
- Github: stephenhky/shorttext. [Github]
- “Short Text Categorization using Deep Neural Networks and Word-Embedding Models,”
*Everything About Data Analytics*, WordPress (2016). [WordPress] - “Release of shorttext 0.2.1,” Everything About Data Analytics, WordPress (2017). [WordPress]

]]>

Generative models are not new. Topic models such as LDA, or STM are generative models. However, I have been using the topic vectors or other topic models such as LDA2Vec as the feature of another supervised algorithm. And it is basically the design of my shorttext package.

I attended a meetup event held by DC Data Science and Data Education DC. The speaker, Daewoo Chong, is a senior Data Scientist at Booz Allen Hamilton. He talked about chatbot, building on RNN models on characters. His talk was not exactly about generative models, but it is indeed about generating texts. With the sophistication of GANs (see my entry on GAN and WGAN), it will surely be my next focus of my toy projects.

Ran Chen wrote a blog on his company homepage about natural language generation in his system, Trulia.

And there are a few GAN applications on text:

- “Generating Text via Adversarial Learning” [PDF]
- Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu, “SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient,” arXiv:1609.05473 [arXiv]
- Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, Dan Jurafsky, “Adversarial Learning for Neural Dialogue Generation,” arXiv:1701.06547 [arXiv]
- Matt J. Kusner, José Miguel Hernández-Lobato, “GANs for sequence of discrete elements with the Gumbel-softmax distribution,” arXiv:1611.04051 [arXiv]
- David Pfau, Oriol Vinyals, “Connecting generative adversarial network and actor-critic methods,” arXiv:1610.01945 [arXiv]
- Xuerong Xiao, “Text Generation usingGenerative Adversarial Training” [PDF]

]]>

In their paper (arXiv:1701.07875), they proposed the following to improve GAN:

- do not use sigmoid function as the activation function;
- the cost functions of the discriminator and generator must not have logarithms;
- cap the parameters at each step of training; and
- do not use momentum-based optimizer such as momentum or Adam; instead, RMSprop or SGD are recommended.

These do not have a theoretical reason, but rather empirical. However, the most important change is to use the Wasserstein distance as the cost function, which the authors explained in detail. There are many metrics to measure the differences between probability distributions, as summarized by Gibbs and Su in the paper, (arXiv:math/0209021) the authors of Wasserstein GAN discussed four of them, namely, the total variation (TV) distance, the Kullback-Leibler (KL) divergence, the Jensen-Shannon (JS) divergence, and the Earth-Mover (EM, Wasserstein) distance. They used an example of two parallel uniform distributions in a line to illustrate that only the EM (or Wasserstein) distance captured the continuity of the distribution distances, solving the problem of zero-change when the derivative of the generator becoming too small. The EM distance indicates, intuitively, how much “mass” must be transported from one distribution to another.

Formally, the EM distance is

,

and the training involves finding the optimal transport path .

Unfortunately, the EM distance cannot be computed directly from the definition. However, as an optimization problem, there is a dual problem corresponding to this. While the author did not explain too much, Vincent Herrmann explained about the dual problem in detail in his blog.

The algorithm is described as follow:

- Martin Arjovsky, Soumith Chintala, Léon Bottou, “Wasserstein GAN,” arXiv:1701.07875 (2017). [arXiv]
- Kwan-Yuet Ho, “Generative Adversarial Networks,”
*Everything about Data Analytics*, WordPress (2017). [WordPress] - Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, “Generative Adversarial Networks,” arXiv:1406.2661 (2014). [arXiv]
- Ian Goodfellow, “NIPS 2016 Tutorial: Generative Adversarial Networks,” arXiv:1701.00160 (2017). [arXiv]
- 郑华滨, “令人拍案叫绝的Wasserstein GAN,” Zhihu Zhuanlan (2017). [Zhihu] (in Chinese (simplified))
- Alison L. Gibbs, Francis Edward Su, “On Choosing and Bounding Probability Metrics,” arXiv:math/0209021 (2002) [arXiv]
- Vincent Herrmann, “Wasserstein GAN and the Kantorovich-Rubinstein Duality.” (2017) [GithubIO]

]]>

- Removal attempts of loading GloVe model, as it can be run using
*gensim*script; - Confirmed compatibility of the package with Tensorflow;
- Use of
*spacy*for tokenization, instead of*nltk*; - Use of
*stemming*for Porter stemmer, instead of*nltk*; - Removal of
*nltk*dependencies; - Simplifying the directory and module structures;
- Module packages updated.

For #1, it actually removed a bug in the previous release. Instead, the users should convert the GloVe models into Word2Vec using the script provided by *gensim*.

For #3, #4, and #5, it is basically removing any *nltk* dependencies, because very few functionalities of *nltk* was used, and it is slow. For Porter stemmer, there is a light-weighted library *stemming* that performs the task perfectly. For tokenization, the tokenizer in *spaCy* is significantly faster than *nltk*, as shown in this Jupyter Notebook. We can do a simple test here, by first importing:

import time import shorttext

Then load the NIH data:

nihdata = shorttext.data.nihreports() nihtext = ' '.join(map(lambda item: ' '.join(item[1]), nihdata.items()))

Then find the time of using the tokenizer in *nltk*:

from nltk import word_tokenize nltkt0 = time.time() tokens = word_tokenize(nihtext) nltkt1 = time.time() print nltkt1-nltkt0, ' sec' # output: 0.0224239826202 sec

On the other hand, using *spaCy* gives:

import spacy nlp = spacy.load('en') spt0 = time.time() doc = nlp(unicode(nihtext)) tokens1 = [token for token in doc] tokens1 = map(str, tokens1) spt1 = time.time() print spt1-spt0, ' sec' # output: 0.00799107551575 sec

Clearly, *spaCy* is three times faster.

#6 indicates a simplification of package structure. Previously, for example, the neural network framework was in shorttext.classifiers.embed.nnlib.frameworks, but now it is shorttext.classifiers.frameworks. But the old package structure is kept for backward compatibility.

- Comparing the tokenizer in
*spaCy*and*nltk*. [Jupyter] - “Python Package for Short Text Mining,”
*Everything About Data Analytics*, WordPress (2016). [WordPress] - Python package: shorttext. [PyPI]
- Package documentation. [PythonHosted]
- Github: stephenhky/shorttext. [Github]
- “Short Text Categorization using Deep Neural Networks and Word-Embedding Models,”
*Everything About Data Analytics*, WordPress (2016). [WordPress]

]]>

Yesterday I attended an event at George Mason University organized by Data Science DC Meetup Group. Jennifer Sleeman talked about GAN. It was a very good talk.

In GAN, there are two important functions, namely, the discriminator (D), and the generator (G). As a generative model, the distribution of training data, all labeled positive, can be thought of the distribution that the generator was trained to produce. The discriminator discriminates the data with positive labels and those with negative labels. Then the generator tries to generate data, probably from noises, which should be negative, to fake the discriminator to see it as positive. This process repeats iteratively, and eventually the generator is trained to produce data that are close to the distribution of the training data, and the discriminator will be confused to classify the generated data as positive with probability . The intuition of this competitive game is from minimax game in game theory. The formal algorithm is described in the original paper as follow:

The original paper discussed about that the distribution of final generated data identical to that of the training data being the optimal for the model, and argued using the Jensen-Shannon (JS) divergence. Ferenc Huszár discussed in his blog about the relations between maximum likelihood, Kullback-Leibler (KL) divergence, and Jensen-Shannon (JS) divergence.

I have asked the speaker a few questions about the concepts of GAN as well.

GAN is not yet a very sophisticated framework, but it already found a few industrial use. Some of its descendants include LapGAN (Laplacian GAN), and DCGAN (deep convolutional GAN). Applications include voice generation, image super-resolution, pix2pix (image-to-image translation), text-to-image synthesis, iGAN (interactive GAN) etc.

“

Adversarial training is the coolest thing since sliced bread.” – Yann LeCun

- Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, “Generative Adversarial Networks,” arXiv:1406.2661 (2014). [arXiv]
- Ian Goodfellow, “NIPS 2016 Tutorial: Generative Adversarial Networks,” arXiv:1701.00160 (2017). [arXiv]
- Ferenc Huszár, “How to Train your Generative Models? And why does Adversarial Training work so well?” inFERENCe (November 2015). [inFERENCe]
- “Generative Adversarial Networks, An Introduction,” Data Science DC. [Meetup] The presentation material of Jennifer Sleeman can be found in this Github repository: jennsleeman/introtogans_dcdatascience_2017.
- Emily Denton, Soumith Chintala, Arthur Szlam, Rob Fergus, “Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks,” arXiv:1506.05751 (2015). [arXiv]
- Alec Radford, Luke Metz, Soumith Chintala, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks,” arXiv:1511.06434 (2015). [arXiv]
- Kwan-Yuet Ho, “Generative-Discriminative Pair,”
*Everything About Data Analytics*, WordPress (2016). [WordPress]

]]>

The most famous topic model is undoubtedly latent Dirichlet allocation (LDA), as proposed by David Blei and his colleagues. Such a topic model is a generative model, described by the following directed graphical models:

In the graph, and are hyperparameters. is the topic distribution of a document, is the topic for each word in each document, is the word distributions for each topic, and is the generated word for a place in a document.

There are models similar to LDA, such as correlated topic models (CTM), where is generated by not only but also a covariance matrix .

There exists an author model, which is a simpler topic model. The difference is that the words in the document are generated from the author for each document, as in the following graphical model. is the author of a given word in the document.

Combining these two, it gives the author-topic model as a hybrid, as shown below:

The new release of Python package, gensim, supported the author-topic model, as demonstrated in this Jupyter Notebook.

P.S.:

- I am also aware that there is another topic model called structural topic model (STM), developed for the field of social science. However, there is no Python package supporting this, but an R package, called stm, is available for it. You can refer to their homepage too.
- I may consider including author-topic model and STM in the next release of the Python package shorttext.

- gensim: Topic Modeling for Humans. [gensim]
- Ólavur Mortensen, “New Gensim feature: Author-topic modeling. LDA with metadata,”
*RaRE Technologies Blog*(Jan 2017). [RaRE] - David M. Blei, Andrew Y.Ng, Michael I Jordan, “Latent Dirichlet Allocation,”
*Journal of Machine Learning Research*, 3 (4–5): pp. 993–1022. (Jan 2003) [JMLR] - Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, Padhraic Smyth, “The author-topic model for authors and documents,” Proceeding UAI ’04, 487-494 (2004). [ACL] [arXiv]
- David Blei, John D. Lafferty, “Correlated Topic Models.” (2006) [CiteSeer]
- “The author-topic model: LDA with metadata.” (Jan 2017) [Jupyter]
- Margaret E. Roberts, Brandon M. Stewart, Edoardo M. Airold, “A Model of Text for Experimentation in the Social Sciences,”
*Journal of American Statistical Association*111 (515): 988-1003 (2016). [PDF] - structuraltopicmodel.com
- “stm: Estimation of the Structural Topic Model.” [CRAN]
- PyPI: shorttext. [PyPI] [WordPress]

]]>

Semantic relations between words are important, because we usually do not have enough data to capture the similarity between words. We do not want “drive” and “drives,” or “driver” and “chauffeur” to be completely different.

The relation between or order of words become important as well. Or we want to capture the concepts that may be correlated in our training dataset.

We have to represent these texts in a special way and perform supervised learning with traditional machine learning algorithms or deep learning algorithms.

This package `shorttext‘ was designed to tackle all these problems. It is not a completely new invention, but putting everything known together. It contains the following features:

- example data provided (including subject keywords and NIH RePORT);
- text preprocessing;
- pre-trained word-embedding support;
- gensim topic models (LDA, LSI, Random Projections) and autoencoder;
- topic model representation supported for supervised learning using scikit-learn;
- cosine distance classification; and
- neural network classification (including ConvNet, and C-LSTM).

Readers can refer this to the documentation.

- Python package: shorttext. [PyPI]
- Package documentation. [PythonHosted]
- Github: stephenhky/shorttext. [Github]
- “Short Text Categorization using Deep Neural Networks and Word-Embedding Models,”
*Everything About Data Analytics*, WordPress (2016). [WordPress]

]]>