Recently, Geoffrey Hinton and his colleagues made the article about capsules available. He has been known to heavily criticize the use of pooling and back propagation.

“A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or object part.” The nodes of inputs and outputs are vectors, instead of scalars as in neural networks. A cheat sheet comparing the traditional neurons and capsules is as follow:

Based on the capsule, the authors suggested a new type of layer called CapsNet.

Huadong Liao implemented CapsNet with TensorFlow according to the paper. (Refer to his repository.)

The theory and the interpretability of deep neural networks have always been called into questions. In the recent few years, there have been several ideas uncovering the theory of neural networks.

# Renormalization Group (RG)

Mehta and Schwab analytically connected renormalization group (RG) with one particular type of deep learning networks, the restricted Boltzmann machines (RBM). (See their paper and a previous post.) RBM is similar to Heisenberg model in statistical physics. This weakness of this work is that it can only explain only one type of deep learning algorithms.

However, this insight gives rise to subsequent work, with the use of density matrix renormalization group (DMRG), entanglement renormalization (in quantum information), and tensor networks, a new supervised learning algorithm was invented. (See their paper and a previous post.)

# Neural Networks as Polynomial Approximation

Lin and Tegmark were not satisfied with the RG intuition, and pointed out a special case that RG does not explain. However, they argue that neural networks are good approximation to several polynomial and asymptotic behaviors of the physical universe, making neural networks work so well in predictive analytics. (See their paper, Lin’s reply on Quora, and a previous post.)

# Information Bottleneck (IB)

Tishby and his colleagues have been promoting information bottleneck as a backing theory of deep learning. (See previous post.) In recent papers such as arXiv:1612.00410, on top of his information bottleneck, they devised an algorithm using variation inference.

# Generalization

Recently, Kawaguchi, Kaelbling, and Bengio suggested that “deep model classes have an exponential advantage to represent certain natural target functions when compared to shallow model classes.” (See their paper and a previous post.) They provided their proof using generalization theory. With this, they introduced a new family of regularization methods.

# Geometric View on Generative Adversarial Networks (GAN)

Recently, Lei, Su, Cui, Yau, and Gu tried to offer a geometric view of generative adversarial networks (GAN), and provided a simpler method of training the discriminator and generator with a large class of transportation problems. However, I am still yet to understand their work, and their experimental results were done on low-dimensional feature spaces. (See their paper.) Their work is very mathematical.

In their paper, Kawaguchi, Kaelbling, and Bengio explored the theory of why generalization in deep learning is so good. Based on their theoretical insights, they proposed a new regularization method, called Directly Approximately Regularizing Complexity (DARC), in addition to commonly used Lp-regularization and dropout methods.

This paper explains why deep learning can generalize well, despite large capacity and possible algorithmic instability, nonrobustness, and sharp minima, effectively addressing an open problem in the literature. Based on our theoretical insight, this paper also proposes a family of new regularization methods. Its simplest member was empirically shown to improve base models and achieve state-of-the-art performance on MNIST and CIFAR-10 benchmarks. Moreover, this paper presents both data-dependent and data-independent generalization guarantees with improved convergence rates. Our results suggest several new open areas of research.

There are many embeddings algorithm for representations. Sammon embedding is the oldest one, and we have Word2Vec, GloVe, FastText etc. for word-embedding algorithms. Embeddings are useful for dimensionality reduction.

Traditionally, quantum many-body states are represented by Fock states, which is useful when the excitations of quasi-particles are the concern. But to capture the quantum entanglement between many solitons or particles in a statistical systems, it is important not to lose the topological correlation between the states. It has been known that restricted Boltzmann machines (RBM) have been used to represent such states, but it has its limitation, which Xun Gao and Lu-Ming Duan have stated in their article published in Nature Communications:

There exist states, which can be generated by a constant-depth quantum circuit or expressed as PEPS (projected entangled pair states) or ground states of gapped Hamiltonians, but cannot be efficiently represented by any RBM unless the polynomial hierarchy collapses in the computational complexity theory.

PEPS is a generalization of matrix product states (MPS) to higher dimensions. (See this.)

However, Gao and Duan were able to prove that deep Boltzmann machine (DBM) can bridge the loophole of RBM, as stated in their article:

Any quantum state of n qubits generated by a quantum circuit of depth T can be represented exactly by a sparse DBM with O(nT) neurons.

(diagram adapted from Gao and Duan’s article)

A preprint on arXiv recently caught a lot of attentions. While deep learning is successful in various types of neural networks, it had not been so for feed-forward neural networks. The authors of this paper proposed normalizing the network with a new activation function, called “selu” (scaled exponential linear units):

$\text{selu}(x) =\lambda \left\{ \begin{array}{cc} x & \text{if } x>0 \\ \alpha e^x - \alpha & \text{if } x \leq 0 \end{array} \right.$.

which is an improvement to the existing “elu” function.

Despite this achievement, what caught the eyeballs is not the activation function, but the 93-page appendix of mathematical proof:

And this is one of the pages in the appendix:

Some scholars teased at it on Twitter too:

In these few days, Facebook published a new research paper, regarding the use of sequence to sequence (seq2seq) model for machine translation. What is special about this seq2seq model is that it uses convolutional neural networks (ConvNet, or CNN), instead of recurrent neural networks (RNN).

The original seq2seq model is implemented with Long Short-Term Memory (LSTM) model, published by Google.(see their paper) It is basically a character-based model that generates texts according to a sequence of input characters. And the same author constructed a neural conversational model, (see their paper) as mentioned in a previous blog post. Daewoo Chong, from Booz Allen Hamilton, presented its implementation using Tensorflow in DC Data Education Meetup on April 13, 2017. Johns Hopkins also published a spell correction algorithm implemented in seq2seq. (see their paper) The real advantage of RNN over CNN is that there is no limit about the size of the tokens input or output.

While the fixing of the size of vectors for CNN is obvious, using CNN serves the purpose of limiting the size of input vectors, and thus limiting the size of contexts. This limits the contents, and speeds up the training process. RNN is known to be trained slow. Facebook uses this CNN seq2seq model for their machine translation model. For more details, take a look at their paper and their Github repository.

On November 21, 2016, the Python package shorttext’ was published. Until today, more than seven versions have been published. There have been a drastic architecture change, but the overall purpose is still the same, as summarized in the first introduction entry:

This package shorttext‘ was designed to tackle all these problems… It contains the following features:

• example data provided (including subject keywords and NIH RePORT);
• text preprocessing;
• pre-trained word-embedding support;
• gensim topic models (LDA, LSI, Random Projections) and autoencoder;
• topic model representation supported for supervised learning using scikit-learn;
• cosine distance classification; and
• neural network classification (including ConvNet, and C-LSTM).

And since the first version, there have been updates, as summarized in the documention (News):

## Version 0.3.3 (Apr 19, 2017)

• Deleted CNNEmbedVecClassifier.

## Version 0.3.2 (Mar 28, 2017)

• Bug fixed for gensim model I/O;
• Console scripts update;
• Neural networks up to Keras 2 standard (refer to this).

## Version 0.3.1 (Mar 14, 2017)

• Compact model I/O: all models are in single files;
• Implementation of stacked generalization using logistic regression.

## Version 0.2.1 (Feb 23, 2017)

• Removal attempts of loading GloVe model, as it can be run using gensim script;
• Confirmed compatibility of the package with tensorflow;
• Use of spacy for tokenization, instead of nltk;
• Use of stemming for Porter stemmer, instead of nltk;
• Removal of nltk dependencies;
• Simplifying the directory and module structures;
• Module packages updated.

Although there are still additions that I would love to add, but it would not change the overall architecture. I may add some more supervised learning algorithms, but under the same network. The upcoming big additions will be generative models or seq2seq models, but I do not see them coming in the short term. I will add corpuses.

I may add tutorials if I have time.

I am thankful that there is probably some external collaboration with other Python packages. Some people have already made some useful contributions. It will be updated if more things are confirmed.

I have worked a lot on text categorization in the past few months, and I started to get bored. I started to become more interested in generative models, and generating texts.

Generative models are not new. Topic models such as LDA, or STM are generative models. However, I have been using the topic vectors or other topic models such as LDA2Vec as the feature of another supervised algorithm. And it is basically the design of my shorttext package.

I attended a meetup event held by DC Data Science and Data Education DC. The speaker, Daewoo Chong, is a senior Data Scientist at Booz Allen Hamilton. He talked about chatbot, building on RNN models on characters. His talk was not exactly about generative models, but it is indeed about generating texts. With the sophistication of GANs (see my entry on GAN and WGAN), it will surely be my next focus of my toy projects.

Ran Chen wrote a blog on his company homepage about natural language generation in his system, Trulia.

And there are a few GAN applications on text:

• “Generating Text via Adversarial Learning” [PDF]
• Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu, “SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient,” arXiv:1609.05473 [arXiv]
• Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, Dan Jurafsky, “Adversarial Learning for Neural Dialogue Generation,” arXiv:1701.06547 [arXiv]
• Matt J. Kusner, José Miguel Hernández-Lobato, “GANs for sequence of discrete elements with the Gumbel-softmax distribution,” arXiv:1611.04051 [arXiv]
• David Pfau, Oriol Vinyals, “Connecting generative adversarial network and actor-critic methods,” arXiv:1610.01945 [arXiv]
• Xuerong Xiao, “Text Generation usingGenerative Adversarial Training” [PDF]

The package shorttext has received attention for the past two months. A new release is released yesterday for the following updates:

1. Removal attempts of loading GloVe model, as it can be run using gensim script;
2. Confirmed compatibility of the package with Tensorflow;
3. Use of spacy for tokenization, instead of nltk;
4. Use of stemming for Porter stemmer, instead of nltk;
5. Removal of nltk dependencies;
6. Simplifying the directory and module structures;
7. Module packages updated.

For #1, it actually removed a bug in the previous release. Instead, the users should convert the GloVe models into Word2Vec using the script provided by gensim.

For #3, #4, and #5, it is basically removing any nltk dependencies, because very few functionalities of nltk was used, and it is slow. For Porter stemmer, there is a light-weighted library stemming that performs the task perfectly. For tokenization, the tokenizer in spaCy is significantly faster than nltk, as shown in this Jupyter Notebook. We can do a simple test here, by first importing:

import time
import shorttext


nihdata = shorttext.data.nihreports()
nihtext = ' '.join(map(lambda item: ' '.join(item[1]), nihdata.items()))


Then find the time of using the tokenizer in nltk:

from nltk import word_tokenize

nltkt0 = time.time()
tokens = word_tokenize(nihtext)
nltkt1 = time.time()
print nltkt1-nltkt0, ' sec'   # output: 0.0224239826202 sec


On the other hand, using spaCy gives:

import spacy