The package shorttext has received attention for the past two months. A new release is released yesterday for the following updates:

1. Removal attempts of loading GloVe model, as it can be run using gensim script;
2. Confirmed compatibility of the package with Tensorflow;
3. Use of spacy for tokenization, instead of nltk;
4. Use of stemming for Porter stemmer, instead of nltk;
5. Removal of nltk dependencies;
6. Simplifying the directory and module structures;
7. Module packages updated.

For #1, it actually removed a bug in the previous release. Instead, the users should convert the GloVe models into Word2Vec using the script provided by gensim.

For #3, #4, and #5, it is basically removing any nltk dependencies, because very few functionalities of nltk was used, and it is slow. For Porter stemmer, there is a light-weighted library stemming that performs the task perfectly. For tokenization, the tokenizer in spaCy is significantly faster than nltk, as shown in this Jupyter Notebook. We can do a simple test here, by first importing:

import time
import shorttext


nihdata = shorttext.data.nihreports()
nihtext = ' '.join(map(lambda item: ' '.join(item[1]), nihdata.items()))


Then find the time of using the tokenizer in nltk:

from nltk import word_tokenize

nltkt0 = time.time()
tokens = word_tokenize(nihtext)
nltkt1 = time.time()
print nltkt1-nltkt0, ' sec'   # output: 0.0224239826202 sec


On the other hand, using spaCy gives:

import spacy

spt0 = time.time()
doc = nlp(unicode(nihtext))
tokens1 = [token for token in doc]
tokens1 = map(str, tokens1)
spt1 = time.time()

print spt1-spt0, ' sec'   # output: 0.00799107551575 sec


Clearly, spaCy is three times faster.

#6 indicates a simplification of package structure. Previously, for example, the neural network framework was in shorttext.classifiers.embed.nnlib.frameworks, but now it is shorttext.classifiers.frameworks. But the old package structure is kept for backward compatibility.

A while ago, Mehta and Schwab drew a connection between Restricted Boltzmann Machine (RBM), a type of deep learning algorithm, and renormalization group (RG), a theoretical tool in physics applied on critical phenomena. [Mehta & Schwab, 2014; see previous entry] Can RG be able to relate to other deep leaning algorithms?

Schwab wrote a paper on a new machine learning algorithm that directly exploit a type of RG in physics: the density matrix renormalization group (DMRG). DMRG is used in condensed matter physics for low-dimensional (d=1 or 2) lattice systems. DMRG was invented by Steve White, using diagonalization of reduced density matrices on each site. [White 1992] However, now it was performed using singular value decomposition for each successive pair of lattice sites.

DMRG is related to quantum entanglement, which is a two-site quantum system, and the entanglement can be characterized by any of its reduced density matrix. However, DMRG deals with reduced density matrix of all sites. Traditionally, this kind of many body systems can be represented by the kets:

$|\Psi \rangle = \sum_{\sigma_1 \ldots \sigma_L} c^{\sigma_1} \ldots c^{\sigma_L} |\sigma_1 \ldots \sigma_L \rangle$.

These c‘s are c-numbers. To describe the entanglement of these states but to remain numerically convenient, it is desirable to convert these c-numbers into matrices: [Schollwöck 2013]

$c^{\sigma_1} \ldots c^{\sigma_L} \rightarrow M^{\sigma_1} \ldots M^{\sigma_L}$.

And these are tensor networks. DMRG aims at finding a good description of the states with these tensor networks. These tensor networks have nice graphical representation, as in the appendix of the paper by Stoudenmire and Schwab. The training is also described in their paper elegantly using these tensor network diagrams. Their new algorithm proves to be a good new machine learning algorithm, probably fit for small data but complicated features. This is a direct application of real-space RG in machine learning algorithm. Stoudenmire wrote in Quora about the value of this work:

“In our work… we reached state-of-the-art accuracy for the MNIST dataset without needing extra techniques such as convolutional layers. One exciting aspect of these proposals is that their cost scales at most linearly in the number of training examples, versus quadratically for most kernel methods. Representing parameters by a tensor network gives them a structure that can be analyzed to better understand the model and what it has learned. Also tensor network optimization methods are adaptive, automatically selecting the minimum number of parameters necessary for the optimal solution within a certain tensor network class.” – Miles Stoudenmire, in Quora

There are some extension algorithms from DMRG, such as multiscale entanglement renormalization ansatz (MERA), developed by Vidal and his colleagues. [Vidal 2008]

Steve R. White (adapted from his faculty homepage)

Tensor Diagram of the Training of this New Algorithm. (Take from arXiv:1605.05775)

Ever since Mehta and Schwab laid out the relationship between restricted Boltzmann machines (RBM) and deep learning mathematically (see my previous entry), scientists have been discussing why deep learning works so well. Recently, Henry Lin and Max Tegmark put a preprint on arXiv (arXiv:1609.09225), arguing that deep learning works because it captures a few essential physical laws and properties. Tegmark is a cosmologist.

Physical laws are simple in a way that a few properties, such as locality, symmetry, hierarchy etc., lead to large-scale, universal, and often complex phenomena. A lot of machine learning algorithms, including deep learning algorithms, have deep relations with formalisms outlined in statistical mechanics.

A lot of machine learning algorithms are basically probability theory. They outlined a few types of algorithms that seek various types of probabilities. They related the probabilities to Hamiltonians in many-body systems.

They argued why neural networks can approximate functions (polynomials) so well, giving a simple neural network performing multiplication. With central limit theorem or Jaynes’ arguments (see my previous entry), a lot of multiplications, they said, can be approximated by low-order polynomial Hamiltonian. This is like a lot of many-body systems that can be approximated by 4-th order Landau-Ginzburg-Wilson (LGW) functional.

Properties such as locality reduces the number of hyper-parameters needed because it restricts to interactions among close proximities. Symmetry further reduces it, and also computational complexities. Symmetry and second order phase transition make scaling hypothesis possible, leading to the use of the tools such as renormalization group (RG). As many people have been arguing, deep learning resembles RG because it filters out unnecessary information and maps out the crucial features. Tegmark use classifying cats vs. dogs as an example, as in retrieving temperatures of a many-body systems using RG procedure. They gave a counter-example to Schwab’s paper with the probabilities cannot be preserved by RG procedure, but while it is sound, but it is not the point of the RG procedure anyway.

They also discussed about the no-flattening theorems for neural networks.

Embedding has been hot in recent years partly due to the success of Word2Vec, (see demo in my previous entry) although the idea has been around in academia for more than a decade. The idea is to transform a vector of integers into continuous, or embedded, representations. Keras, a Python package that implements neural network models (including the ANN, RNN, CNN etc.) by wrapping Theano or TensorFlow, implemented it, as shown in the example below (which converts a vector of 200 features into a continuous vector of 10):

from keras.layers import Embedding
from keras.models import Sequential

# define and compile the embedding model
model = Sequential()
model.compile('rmsprop', 'mse')  # optimizer: rmsprop; loss function: mean-squared error


We can then convert any features from 0 to 199 into vectors of 20, as shown below:

import numpy as np

model.predict(np.array([10, 90, 151]))


It outputs:

array([[[ 0.02915354,  0.03084954, -0.04160764, -0.01752155, -0.00056815,
-0.02512387, -0.02073313, -0.01154278, -0.00389587, -0.04596512]],

[[ 0.02981793, -0.02618774,  0.04137352, -0.04249889,  0.00456919,
0.04393572,  0.04139435,  0.04415271,  0.02636364, -0.04997493]],

[[ 0.00947296, -0.01643104, -0.03241419, -0.01145032,  0.03437041,
0.00386361, -0.03124221, -0.03837727, -0.04804075, -0.01442516]]])


Of course, one must not omit a similar algorithm called GloVe, developed by the Stanford NLP group. Their codes have been wrapped in both Python (package called glove) and R (library called text2vec).

Besides Word2Vec, there are other word embedding algorithms that try to complement Word2Vec, although many of them are more computationally costly. Previously, I introduced LDA2Vec in my previous entry, an algorithm that combines the locality of words and their global distribution in the corpus. And in fact, word embedding algorithms with a similar ideas are also invented by other scientists, as I have introduced in another entry.

However, there are word embedding algorithms coming out. Since most English words carry more than a single sense, different senses of a word might be best represented by different embedded vectors. Incorporating word sense disambiguation, a method called sense2vec has been introduced by Trask, Michalak, and Liu. (arXiv:1511.06388). Matthew Honnibal wrote a nice blog entry demonstrating its use.

There are also other related work, such as wang2vec that is more sensitive to word orders.

Big Bang Theory (Season 2, Episode 5): Euclid Alternative

DMV staff: Application?
Sheldon: I’m actually more or a theorist.

Note: feature image taken from Big Bang Theory (CBS).

The descriptive power of deep learning has bothered a lot of scientists and engineers, despite its powerful applications in data cleaning, natural language processing, playing Go, computer vision etc. A while ago, as stated in my previous blog entry, Mehta and Schwab discussed the mathematical equivalence between renormalization group (RG) and restricted Boltzmann machines (RBM), a type of deep learning algorithm. [Mehta & Schwab, 2014] I think it is insightful in a way that in each round of calculation, irrelevant information is filtered out by diminishing the weight. Each step is sort of like an RG step. However, this work has two weaknesses: 1) it is restricted to one specific type of deep learning, i.e., RBM; 2) it does not provide insight of how to choose the hyperparameters. It offers an insightful explanation, but it is not useful.

A few weeks ago, one of my friends introduced to me the work by Tishby and his colleagues. It does not only provide insights to why deep learning works, but also sheds light on how to choose hyperparameters. It makes use of the concept of information bottleneck (IB). Information bottleneck is a technique in information theory that aims at capturing the relevant information in the input variables x so that the output variable y can be most accurately predicted. A technique derived by Tishby, [Tishby & Pereira, 1999] it is proposed to use in choosing the hyperparameters of deep neural networks (DNN). [Tishby & Zalavsky, 2015] The idea is to get a functional, the DNN itself in this context, that captures the most relevant information in x to output y. So instead of coarse-graining information in each step as in RG, the algorithm is to have the most compact form before it was even trained. It is not only insightful, but sounds practical.

But its practicality needs to be tested over time.

On October 14, 2015, I attended the regular meeting of the DCNLP meetup group, a group on natural language processing (NLP) in Washington, DC area. The talk was titled “Deep Learning for Question Answering“, spoken by Mr. Mohit Iyyer, a Ph.D. student in Department of Computer Science, University of Maryland (my alma mater!). He is a very good speaker.

I have no experience on deep learning at all although I did write a blog post remotely related. I even didn’t start training my first neural network until the next day after the talk. However, Mr. Iyyer explained what recurrent neural network (RNN), recursive neural network, and deep averaging network (DAN) are. This helped me a lot in order to understanding more about the principles of the famous word2vec model (which is something I am going to write about soon!). You can refer to his slides for more details. There are really a lot of talents in College Park, like another expert, Joe Yue Hei Ng, who is exploiting deep learning a lot as well.

The applications are awesome: with external knowledge to factual question answering, reasoning-based question answering, and visual question answering, with increasing order of challenging levels.

Mr. Iyyer and the participants discussed a lot about different packages. Mr. Iyyer uses Theano, a Python package for deep learning, which is good for model building and other analytical work. Some prefer Caffe. Some people, who are Java developers, also use deeplearning4j.

This meetup was a sacred one too, because it is the last time it was held in Stetsons Famous Bar & Grill at U Street, which is going to permanently close on Halloween this year. The group is eagerly looking for a new venue for the upcoming meetup. This meeting was a crowded one. I sincerely thank the organizers, Charlie Greenbacker and Liz Merkhofer, for hosting all these meetings, and Chris Phipps (a linguist from IBM Watson) for recording.

Deep learning, a collection of related neural network algorithms, has been proved successful in certain types of machine learning tasks in computer vision, speech recognition, data cleaning, and natural language processing (NLP). [Mikolov et. al. 2013] However, it was unclear how deep learning can be so successful. It looks like a black box with messy inputs and excellent outputs. So why is it so successful?

A friend of mine showed me this article in the preprint (arXiv:1410.3831) [Mehta & Schwab 2014] last year, which mathematically shows the equivalence of deep learning and renormalization group (RG). RG is a concept in theoretical physics that has been widely applied in different problems, including critical phenomena, self-organized criticality, particle physics, polymer physics, and strongly correlated electronic systems. And now, Mehta and Schwab showed that an explanation to the performance of deep learning is available through RG.

So what is RG? Before RG, Leo Kadanoff, a physics professor in University of Chicago, proposed an idea of coarse-graining in studying many-body problems in 1966. [Kadanoff 1966] In 1972, Kenneth Wilson and Michael Fisher succeeded in applying ɛ-expansion in perturbative RG to explain the critical exponents in Landau-Ginzburg-Wilson (LGW) Hamiltonian. [Wilson & Fisher 1972] This work has been the standard material of graduate physics courses. In 1974, Kenneth Wilson applied RG to explain the Kondo problem, which led to his Nobel Prize in Physics in 1982. [Wilson 1983]

RG assumes a system of scale invariance, which means the system are similar in whatever scale you are seeing. One example is the chaotic system as in Fig. 1. The system looks the same when you zoom in. We call this scale-invariant system self-similar. And physical systems closed to phase transition are self-similar. And if it is self-similar, Kadanoff’s idea of coarse-graining is then applicable, as in Fig. 2. Four spins can be viewed as one spin that “summarizes” the four spins in that block without changing the description of the physical system. This is somewhat like we “zoom out” the picture on Photoshop or Web Browser.

[Taken from [Singh 2014]]

So what’s the point of zooming out? Physicists care about the Helmholtz free energies of physical systems, which are similar to cost functions to the computer scientists and machine learning specialists. Both are to be minimized. However, whatever scale we are viewing at, the energy of the system should be scale-invariant. Therefore, as we zoom out, the system “changes” yet “looks the same” due to self-similarity, but the energy stays the same. The form of the model is unchanged, but the parameters change as the scale changes.

This is important, because this process tells us which parameters are relevant, and which others are irrelevant. Why? Think of it this way: we have an awesome computer to simulate a glass of water that contains 1023 water molecules. To describe the systems, you have all parameters, including the position of molecules, strength of Van der Waals force, orbital angular momentum of each atom, strength of the covalent bonds, velocities of the molecules… You might have 1025 parameters. However, this awesome computer cannot handle such a system with so many parameters. Then you try to coarse-grain the system, and you discard some parameters in each step of coarse-graining. After numerous steps, it turns out that the temperature and the pressure are the only relevant parameters.

RG helps you identify the relevant parameters.

And it is exactly what happened in deep learning. In each convolutional cycle, features that are not important are gradually discarded, and those that are important are kept and enhanced. Indeed, in computer vision and NLP, the data are so noisy that there are a lot of unnecessary information. Deep learning gradually discards these information. As Mehta and Schwab stated, [Mehta & Schwab 2014]

Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.

So what is the point of understanding this? Unlike other machine algorithms, we did not know how it works, which sometimes makes model building very difficult because we have no idea how to adjust parameters. I believe understanding its equivalence to RG helps guide us to build a model that works.

Charles Martin also wrote a blog entry with more demonstration about the equivalence of deep learning and RG. [Martin 2015]

Chatbots are not something new: I encountered them online since 2004, when I started graduate school. You can find some of them under this link: https://sites.google.com/site/webtoolsbox/bots . There is another one called CleverBot: http://www.cleverbot.com/.

However, you can see that they are pretty dumb when you try them out. I have a feeling that those conversational models are Markov models, and the chatbots basically forget what they previously said.

A few days ago, Google put a paper on arXiv preprint that described how they applied neural network conversational models in chatbots. [Vinyals & Le 2015] In short they look for more previous sentences in the conversations. It is certainly inspired by some previous natural language processing (NLP) work from Google, especially Word2Vec that employs skip-gram models. [Mikolov et. al. 2013]

Deep learning is what a lot of big guys are trying now.

I wish to play with this chatbot.

Eminem (taken from web)

When I saw the “standardized” style of writing written in the classic book “Elements of Style” written by William Strunk, I have been wondering if the style of writing can be programmable. And now, with artificial intelligence, people can write automated codes that generate lyrics. In a paper written by Eric Malmi and his collaborators [Malmi, Takala, Toivonen, Raiko, Gionis 2015], the system DopeLearning, which generates rap lyrics to certain complexities, was introduced. It applies two machine learning techniques, namely the RankSVM and the deep neural network. It is fascinating that automated codes can be creative to produce complex artistic things, as the abstract says:

Writing rap lyrics requires both creativity, to construct a meaningful and an interesting story, and lyrical skills, to produce complex rhyme patterns, which are the cornerstone of a good flow.

What does DopeLearning produce? See the example the paper gives:

For a chance at romance I would love to enhance (Big Daddy Kane – The Day You’re Mine)
But everything I love has turned to a tedious task (Jedi Mind Tricks – Black Winter Day)
One day we gonna have to leave our love in the past (Lil Wayne – Marvin’s Room)
I love my fans but no one ever puts a grasp (Eminem – Say Goodbye Hollywood)
I love you momma I love my momma – I love you momma (Snoop Dogg – I Love My Momma)
And I would love to have a thing like you on my team you take care (Ghostface Killah – Paragraphs Of Love)
I love it when it’s sunny Sonny girl you could be my Cher (Common – Make My Day)
I’m in a love affair I can’t share it ain’t fair (Snoop Dogg – Show Me Love)
Haha I’m just playin’ ladies you know I love you. (Eminem – Kill You)
I know my love is true and I know you love me too (Everlast – On The Edge)
Girl I’m down for whatever cause my love is true (Lil Wayne – Sean Kingston I’m At War)
This one goes to my man old dirty one love we be swigging brew (Big Daddy Kane – Entaprizin)
My brother I love you Be encouraged man And just know (Tech N9ne – Need More Angels)
When you done let me know cause my love make you be like WHOA (Missy Elliot – Dog In Heat)
If I can’t do it for the love then do it I won’t (KRS One – Take It To God)
All I know is I love you too much to walk away though (Eminem – Love The Way You Lie)

There are similar work for Chinese Mandopop, using RNN. Chinese readers can refer to this blog post: http://phunters.lofter.com/post/86d56_732209b.