GAN can be used in word translation problem too. In a recent preprint in arXiv (refer to arXiv:1710.04087), Wasserstein GAN has been used to train a machine translation machine, given that there are no parallel data between the word embeddings between two languages. The translation mapping is seen as a generator, and the mapping is described using Wasserstein distance. The training objective is cross-domain similarity local scaling (CSLS). Their work has been performed in English-Russian and English-Chinese mappings.

It seems to work. Given GAN sometimes does not work for unknown reasons, it is an excitement that it works.

- “Generative Adversarial Networks,”
*Everything About Data Analytics*, WordPress (2017). [WordPress] - Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, “Generative Adversarial Networks,” arXiv:1406.2661 (2014). [arXiv]
- Ian Goodfellow, “NIPS 2016 Tutorial: Generative Adversarial Networks,” arXiv:1701.00160 (2017). [arXiv]
- Na Lei, Kehua Su, Li Cui, Shing-Tung Yau, David Xianfeng Gu, “A Geometric View of Optimal Transportation and Generative Model,” arXiv:1710.05488 (2017). [arXiv]
- “On Wasserstein GAN,”
*Everything About Data Analytics*, WordPress (2017). [WordPress] - “Interpretability of Neural Networks,”
*Everything About Data Analytics*, WordPress (2017). [WordPress] - Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou, “Word Translation Without Parallel Data,” arXiv:1710.04087 (2017). [arXiv]
- “Word Mover’s Distance as a Linear Programming Problem,”
*Everything About Data Analytics*, WordPress (2017). [WordPress] - 罗若天, “论文笔记：Word translation without parallel data无监督单词翻译” RT的论文笔记以及其他乱七八糟的东西. (2017) [Zhihu]
- “Word Embedding Algorithms,”
*Everything About Data Analytics*, WordPress (2016). [WordPress]

]]>

“A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or object part.” The nodes of inputs and outputs are vectors, instead of scalars as in neural networks. A cheat sheet comparing the traditional neurons and capsules is as follow:

Based on the capsule, the authors suggested a new type of layer called CapsNet.

Huadong Liao implemented CapsNet with TensorFlow according to the paper. (Refer to his repository.)

- Sara Sabour, Nicholas Frosst, Geoffrey E Hinton, “Dynamic Routing Between Capsules,” arXiv:1710.09829 (2017). [arXiv]
- “浅析 Hinton 最近提出的 Capsule 计划” (2017). [Zhihu] (in Chinese)
- “如何看待Hinton的论文《Dynamic Routing Between Capsules》？” (2017) [Zhihu] (in Chinese)
- Github: naturomics/CapsNet-Tensorflow [Github]
- Nick Bourdakos, “Capsule Networks Are Shaking up AI — Here’s How to Use Them,” Medium (2017). [Medium]

]]>

Mehta and Schwab analytically connected renormalization group (RG) with one particular type of deep learning networks, the restricted Boltzmann machines (RBM). (See their paper and a previous post.) RBM is similar to Heisenberg model in statistical physics. This weakness of this work is that it can only explain only one type of deep learning algorithms.

However, this insight gives rise to subsequent work, with the use of density matrix renormalization group (DMRG), entanglement renormalization (in quantum information), and tensor networks, a new supervised learning algorithm was invented. (See their paper and a previous post.)

Lin and Tegmark were not satisfied with the RG intuition, and pointed out a special case that RG does not explain. However, they argue that neural networks are good approximation to several polynomial and asymptotic behaviors of the physical universe, making neural networks work so well in predictive analytics. (See their paper, Lin’s reply on Quora, and a previous post.)

Tishby and his colleagues have been promoting information bottleneck as a backing theory of deep learning. (See previous post.) In recent papers such as arXiv:1612.00410, on top of his information bottleneck, they devised an algorithm using variation inference.

Recently, Kawaguchi, Kaelbling, and Bengio suggested that “deep model classes have an exponential advantage to represent certain natural target functions when compared to shallow model classes.” (See their paper and a previous post.) They provided their proof using generalization theory. With this, they introduced a new family of regularization methods.

Recently, Lei, Su, Cui, Yau, and Gu tried to offer a geometric view of generative adversarial networks (GAN), and provided a simpler method of training the discriminator and generator with a large class of transportation problems. However, I am still yet to understand their work, and their experimental results were done on low-dimensional feature spaces. (See their paper.) Their work is very mathematical.

- Pankaj Mehta, David J. Schwab, “An exact mapping between the Variational Renormalization Group and Deep Learning,” arXiv:1410.3831. (2014) [arXiv]
- E. Miles Stoudenmire, David J. Schwab, “Supervised Learning With Quantum-Inspired Tensor Networks,” arXiv:1605.05775 (2016). [arXiv]
- Cédric Bény, “Deep learning and the renormalization group,” arXiv:1301.3124 (2013). [arXiv]
- Charles H. Martin, “on Cheap Learning: Partition Functions and RBMs,”
*Machine Learning*, WordPress (2016). [WordPress] - Henry W. Lin, Max Tegmark, “Why does deep and cheap learning work so well?” arXiv:1608.08225 (2016). [arXiv]
- Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, Kevin Murphy, “Deep Variational Information Bottleneck,” arXiv:1612.00410 (2016). [arXiv]
- Kenji Kawaguchi, Leslie Pack Kaelbling, Yoshua Bengio, “Generalization in Deep Learning,” arXiv:1710.05468 (2017). [arXiv]
- Na Lei, Kehua Su, Li Cui, Shing-Tung Yau, David Xianfeng Gu, “A Geometric View of Optimal Transportation and Generative Model,” arXiv:1710.05488 (2017). [arXiv]

]]>

This paper explains why deep learning can generalize well, despite large capacity and possible algorithmic instability, nonrobustness, and sharp minima, effectively addressing an open problem in the literature. Based on our theoretical insight, this paper also proposes a family of new regularization methods. Its simplest member was empirically shown to improve base models and achieve state-of-the-art performance on MNIST and CIFAR-10 benchmarks. Moreover, this paper presents both data-dependent and data-independent generalization guarantees with improved convergence rates. Our results suggest several new open areas of research.

- Kenji Kawaguchi, Leslie Pack Kaelbling, Yoshua Bengio, “Generalization in Deep Learning,” arXiv:1710.05468 (2017). [arXiv]

]]>

Google published a paper about the big picture of computational model in TensorFlow:

TensorFlow is a powerful, programmable system for machine learning. This paper aims to provide the basics of a conceptual framework for understanding the behavior of TensorFlow models during training and inference: it describes an operational semantics, of the kind common in the literature on programming languages. More broadly, the paper suggests that a programming-language perspective is fruitful in designing and in explaining systems such as TensorFlow.

Beware that this model is not limited to deep learning.

- Coursera: Deep Learning Specialization. [Coursera]
- TensorFlow. [TensorFlow]
- Martin Abadi, Michael Isard, Derek G. Murray, “A Computational Model in TensorFlow,”
*Google Research Blog*(MAPL 2017). [GoogleResearch]

]]>

Traditionally, quantum many-body states are represented by Fock states, which is useful when the excitations of quasi-particles are the concern. But to capture the quantum entanglement between many solitons or particles in a statistical systems, it is important not to lose the topological correlation between the states. It has been known that restricted Boltzmann machines (RBM) have been used to represent such states, but it has its limitation, which Xun Gao and Lu-Ming Duan have stated in their article published in *Nature Communications*:

There exist states, which can be generated by a constant-depth quantum circuit or expressed as PEPS (

projected entangled pair states) or ground states of gapped Hamiltonians, but cannot be efficiently represented by any RBM unless the polynomial hierarchy collapses in the computational complexity theory.

PEPS is a generalization of matrix product states (MPS) to higher dimensions. (See this.)

However, Gao and Duan were able to prove that deep Boltzmann machine (DBM) can bridge the loophole of RBM, as stated in their article:

Any quantum state of

nqubits generated by a quantum circuit of depthTcan be represented exactly by a sparse DBM withO(nT) neurons.

(diagram adapted from Gao and Duan’s article)

- Xun Gao, Lu-Ming Duan, “Efficient representation of quantum many-body states with deep neural networks,”
*Nature Communications*8:662 (2017) or arXiv:1701.05039 (2017). [NatureComm] [arXiv] - Kwan-Yuet Ho, “Sammon Embedding with TensorFlow,”
*Everything About Data Analytics*, WordPress (2017). [WordPress] - Kwan-Yuet Ho, “Word Embedding Algorithms,”
*Everything About Data Analytics*, WordPress (2017). [WordPress] - FastText. [Facebook]
- Kwan-Yuet Ho, “Tensor Networks and Density Matrix Renormalization Group,”
*Everything About Data Analytics*, WordPress (2016). [WordPress]

]]>

Recently, Rigetti, a startup for quantum computing service in Bay Area, published that they opened to public their cloud server for users to simulate the use of quantum instruction language, as described in their blog and their White Paper. It is free.

Go to their homepage, http://rigetti.com/, click on “Get Started,” and fill in your information and e-mail. Then you will be e-mailed keys of your cloud account. Copy the information to a file `.pyquil_config`

, and in your `.bash_profile`

, add a line

`export PYQUIL_CONFIG="$HOME/.pyquil_config"`

More information can be found in their Installation tutorial. Then install the Python package `pyquil`

, by typing in the command line:

`pip install -U pyquil`

Some of you may need to root (adding `sudo`

in front).

Then we can go ahead to open Python, or iPython, or Jupyter notebook, to play with it. For the time being, let me play with creating an entangled singlet state, . The corresponding quantum circuit is like this:

First of all, import all necessary libraries:

import numpy as np from pyquil.quil import Program import pyquil.api as api from pyquil.gates import H, X, Z, CNOT

You can see that the package includes a lot of quantum gates. First, we need to instantiate a quantum simulator:

# starting the quantum simulator quantum_simulator = api.SyncConnection()

Then we implement the quantum circuit with a “program” as follow:

# generating singlet state # 1. Hadamard gate # 2. Pauli-Z # 3. CNOT # 4. NOT p = Program(H(0), Z(0), CNOT(0, 1), X(1)) wavefunc, _ = quantum_simulator.wavefunction(p)

The last line gives the final wavefunction after running the quantum circuit, or “program.” For the ket, the rightmost qubit is qubit 0, and the left of it is qubit 1, and so on. Therefore, in the first line of the program, `H`

, the Hadamard gate, acts on qubit 0, i.e., the rightmost qubit. Running a simple print statement:

print wavefunc

gives

(-0.7071067812+0j)|01> + (0.7071067812+0j)|10>

The coefficients are complex, and the imaginary part is described by `j`

. You can extract it as a `numpy`

array:

wavefunc.amplitudes

If we want to calculate the metric of entanglement, we can use the Python package `pyqentangle`

, which can be installed by running on the console:

`pip install -U pyqentangle`

Import them:

from pyqentangle import schmidt_decomposition from pyqentangle.schmidt import bipartitepurestate_reduceddensitymatrix from pyqentangle.metrics import entanglement_entropy, negativity

Because `pyqentangle`

does not recognize the coefficients in the same way as `pyquil`

, but see each element as the coefficients of , we need to reshape the final state first, by:

tensorcomp = wavefunc.amplitudes.reshape((2, 2))

Then perform Schmidt decomposition (which the Schmidt modes are actually trivial in this example):

# Schmidt decomposition schmidt_modes = schmidt_decomposition(tensorcomp) for prob, modeA, modeB in schmidt_modes: print prob, ' : ', modeA, ' ', modeB

This outputs:

0.5 : [ 0.+0.j 1.+0.j] [ 1.+0.j 0.+0.j] 0.5 : [-1.+0.j 0.+0.j] [ 0.+0.j 1.+0.j]

Calculate the entanglement entropy and negativity from its reduced density matrix:

print 'Entanglement entropy = ', entanglement_entropy(bipartitepurestate_reduceddensitymatrix(tensorcomp, 0)) print 'Negativity = ', negativity(bipartitepurestate_reduceddensitymatrix(tensorcomp, 0))

which prints:

Entanglement entropy = 0.69314718056 Negativity = -1.11022302463e-16

The calculation can be found in this thesis.

P.S.: The circuit was drawn by using the tool in this website, introduced by the Marco Cezero’s blog post. The corresponding json for the circuit is:

{"gate":[],{"gate":[], "circuit": [{"type":"h", "time":0, "targets":[0], "controls":[]}, {"type":"z", "time":1, "targets":[0], "controls":[]}, {"type":"x", "time":2, "targets":[1], "controls":[0]}, {"type":"x", "time":3, "targets":[1], "controls":[]}], "qubits":2,"input":[0,0]}

- Kwan-Yuet Ho, “On Quantum Computing,”
*Everything About Data Analytics*, WordPress (2016). [WordPress] - Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, Nathan Wiebe, Seth Lloyd, “Quantum Machine Learning,”
*Nature*549:195-202 (2017). [Nature][arXiv] - Rigetti Computing. [Rigetti]
- Madhav Thattai, Will Zeng, “Rigetti Partners with CDL to Drive Quantum Machine Learning,”
*Rigetti Computing, Medium*(2017). [Medium] - Robert S. Smith, Michael J. Curtis, William J. Zeng, “A Practical Quantum Instruction Set Architecture,” arXiv:1608.03355 (2016). [arXiv] (White Paper)
- Homepage of pyQuil. [RTFD]
- Github: rigetticomputing/pyquil. [Github]
- hahakity, “免费云量子计算机试用指南,” 知乎专栏. (2017) [Zhihu] (in Chinese)
- Homepage of PyQEntangle. [RTFD]
- Github: stephenhky/pyqentangle. [Github]
- Kwan-Yuet Ho, “Quantum Entanglement in Continuous Systems,”
*BSc Thesis*, Department of Physics, Chinese University of Hong Kong. (2004) [ResearchGate] - Kwan-Yuet Ho, “The Legacy of Entropy,”
*Everything About Data Analytics*, WordPress (2015). [WordPress] - Marco Cerezo, “Tools for Drawing Quantum Circuits,”
*Entangled Physics: Quantum Information & Quantum Computation.*(2016) [WordPress]

]]>

`shorttext`

has a new release: 0.5.4. It can be installed by typing in the command line:
`pip install -U shorttext`

For some people, you may need to install it from “root”, i.e., adding `sudo`

in front of the command. Since the version 0.5 (including releases 0.5.1 and 0.5.4), there have been substantial addition of functionality, mostly about comparisons between short phrases without running a supervised or unsupervised machine learning algorithm, but calculating the “similarity” with various metrics, including:

- soft Jaccard score (the same kind of fuzzy scores based on edit distance in SOCcer),
- Word Mover’s distance (WMD, detailedly described in a previous post), and
- Jaccard index due to word-embedding model.

For the soft Jaccard score due to edit distance, we can call it by:

>>> from shorttext.metrics.dynprog import soft_jaccard_score >>> soft_jaccard_score(['book', 'seller'], ['blok', 'sellers']) # gives 0.6716417910447762 >>> soft_jaccard_score(['police', 'station'], ['policeman']) # gives 0.2857142857142858

The core of this code was written in C, and interfaced to Python using SWIG.

For the Word Mover’s Distance (WMD), while the source codes are the same as my previous post, it can now be called directly. First, load the modules and the word-embedding model:

>>> from shorttext.metrics.wasserstein import word_mover_distance >>> from shorttext.utils import load_word2vec_model >>> wvmodel = load_word2vec_model('/path/to/model_file.bin')

And compute the WMD with a single function:

>>> word_mover_distance(['police', 'station'], ['policeman'], wvmodel) # gives 3.060708999633789 >>> word_mover_distance(['physician', 'assistant'], ['doctor', 'assistants'], wvmodel) # gives 2.276337146759033

And the Jaccard index due to cosine distance in Word-embedding model can be called like this:

>>> from shorttext.metrics.embedfuzzy import jaccardscore_sents >>> jaccardscore_sents('doctor', 'physician', wvmodel) # gives 0.6401538990056869 >>> jaccardscore_sents('chief executive', 'computer cluster', wvmodel) # gives 0.0022515450768836143 >>> jaccardscore_sents('topological data', 'data of topology', wvmodel) # gives 0.67588977344632573

Most new functions can be found in this tutorial.

And there are some minor bugs fixed.

- PyPI: shorttext. [PyPI]
- Homepage of shorttext. [RTFD]
- Tutorial of Metrics in Shorttext. [RTFD]
- “Short Text Categorization using Deep Neural Networks and Word-Embedding Models,”
*Everything About Data Analytics*, WordPress (2016). [WordPress] - “Python Package for Short Text Mining,”
*Everything About Data Analytics*, WordPress (2016). [WordPress]

]]>

The formulation of WMD is beautiful. Consider the embedded word vectors , where is the dimension of the embeddings, and is the number of words. For each phrase, there is a normalized BOW vector , and , where ‘s denote the word tokens. The distance between words are the Euclidean distance of their embedded word vectors, denoted by , where and denote word tokens. The document distance, which is WMD here, is defined by , where is a matrix. Each element denote how nuch of word in the first document (denoted by ) travels to word in the new document (denoted by ).

Then the problem becomes the minimization of the document distance, or the WMD, and is formulated as:

,

given the constraints:

, and

.

This is essentially a simplified case of the Earth Mover’s distance (EMD), or the Wasserstein distance. (See the review by Gibbs and Su.)

The WMD is essentially a linear optimization problem. There are many optimization packages on the market, and my stance is that, for those common ones, there are no packages that are superior than others. In my job, I happened to handle a missing data problem, in turn becoming a non-linear optimization problem with linear constraints, and I chose limSolve, after I shop around. But I actually like a lot of other packages too. For WMD problem, I first tried out cvxopt first, which should actually solve the exact same problem, but the indexing is hard to maintain. Because I am dealing with words, it is good to have a direct hash map, or a dictionary. I can use the Dictionary class in gensim. But I later found out I should use PuLP, as it allows indices with words as a hash map (dict in Python), and WMD is a linear programming problem, making PuLP is a perfect choice, considering code efficiency.

An example of using PuLP can be demonstrated by the British 1997 UG Exam, as in the first problem of this link, with the Jupyter Notebook demonstrating this.

The demonstration can be found in the Jupyter Notebook.

Load the necessary packages:

from itertools import product from collections import defaultdict import numpy as np from scipy.spatial.distance import euclidean import pulp import gensim

Then define the functions the gives the BOW document vectors:

def tokens_to_fracdict(tokens): cntdict = defaultdict(lambda : 0) for token in tokens: cntdict[token] += 1 totalcnt = sum(cntdict.values()) return {token: float(cnt)/totalcnt for token, cnt in cntdict.items()}

Then implement the core calculation. Note that PuLP is actually a symbolic computing package. This function return a `pulp.LpProblem`

class:

def word_mover_distance_probspec(first_sent_tokens, second_sent_tokens, wvmodel, lpFile=None): all_tokens = list(set(first_sent_tokens+second_sent_tokens)) wordvecs = {token: wvmodel[token] for token in all_tokens} first_sent_buckets = tokens_to_fracdict(first_sent_tokens) second_sent_buckets = tokens_to_fracdict(second_sent_tokens) T = pulp.LpVariable.dicts('T_matrix', list(product(all_tokens, all_tokens)), lowBound=0) prob = pulp.LpProblem('WMD', sense=pulp.LpMinimize) prob += pulp.lpSum([T[token1, token2]*euclidean(wordvecs[token1], wordvecs[token2]) for token1, token2 in product(all_tokens, all_tokens)]) for token2 in second_sent_buckets: prob += pulp.lpSum([T[token1, token2] for token1 in first_sent_buckets])==second_sent_buckets[token2] for token1 in first_sent_buckets: prob += pulp.lpSum([T[token1, token2] for token2 in second_sent_buckets])==first_sent_buckets[token1] if lpFile!=None: prob.writeLP(lpFile) prob.solve() return prob

To extract the value, just run `pulp.value(prob.objective)`

We use Google Word2Vec. Refer the matrices in the Jupyter Notebook. Running this by a few examples:

- document1 = President, talk, Chicago

document2 = President, speech, Illinois

WMD = 2.88587622936 - document1 = physician, assistant

document2 = doctor

WMD = 2.8760048151 - document1 = physician, assistant

document2 = doctor, assistant

WMD = 1.00465738773

(compare with example 2!) - document1 = doctors, assistant

document2 = doctor, assistant

WMD = 1.02825379372

(compare with example 3!) - document1 = doctor, assistant

document2 = doctor, assistant

WMD = 0.0

(totally identical; compare with example 3!)

There are more examples in the notebook.

WMD is a good metric comparing two documents or sentences, by capturing the semantic meanings of the words. It is more powerful than BOW model as it captures the meaning similarities; it is more powerful than the cosine distance between average word vectors, as the transfer of meaning using words from one document to another is considered. But it is not immune to the problem of misspelling.

This algorithm works well for short texts. However, when the documents become large, this formulation will be computationally expensive. The author actually suggested a few modifications, such as the removal of constraints, and word centroid distances.

Example codes can be found in my Github repository: stephenhky/PyWMD.

- Matt Kusner, Yu Sun, Nicholas Kolkin, Kilian Weinberger, “From Word Embeddings To Document Distances,”
*Proceedings of the 32nd International Conference on Machine Learning*, PMLR 37:957-966 (2015). [PMLR] - Github: mkusner/wmd. [Github]
- Kwan-Yuet Ho, “Toying with Word2Vec,”
*Everything About Data Analytics*, WordPress (2015). [WordPress] - Kwan-Yuet Ho, “On Wasserstein GAN,”
*Everything About Data Analytics*, WordPress (2017). [WordPress] - Martin Arjovsky, Soumith Chintala, Léon Bottou, “Wasserstein GAN,” arXiv:1701.07875 (2017). [arXiv]
- Alison L. Gibbs, Francis Edward Su, “On Choosing and Bounding Probability Metrics,” arXiv:math/0209021 (2002) [arXiv]
- cvxopt: Python Software for Convex Optimization. [HTML]
- gensim: Topic Modeling for Humans. [HTML]
- PuLP: Optimization for Python. [PythonHosted]
- Demonstration of PuLP: Github: stephenhky/PyWMD. [Jupyter]
- Implemenation of WMD: Github: stephenhky/PyWMD. [Jupyter]
- Github: stephenhky/PyWMD. [Github]

Feature image adapted from the original paper by Kusner *et. al.*

]]>

On Aug 9, 2017, Data Science DC held an event titled “Fake News as a Data Science Challenge, ” spoken by Professor Jen Golbeck from University of Maryland. It is an interesting talk.

Fake news itself is a big problem. It has philosophical, social, political, or psychological aspects, but Prof. Golbeck focused on its data science aspect. But to make it a computational problem, a clear and succinct definition of “fake news” has to be present, but it is already challenging. Some “fake news” is pun intended, or sarcasm, or jokes (like The Onion). Some misinformation is shared through Twitter or Facebook not because of deceiving purpose. Then a line to draw is difficult. But the undoubtable part is that we want to fight against news with *malicious intent*.

To fight fake news, as Prof. Golbeck has pointed out, there are three main tasks:

- detecting the content;
- detecting the source; and
- modifying the intent.

Statistical tools can be exploited too. She talked about Benford’s law, which states that, in naturally occurring systems, the frequency of numbers’ first digits is not evenly distributed. Anomaly in the distribution of some news can be used as a first step of fraud detection. (Read her paper.)

There are also efforts, Fake News Challenge for example, in building corpus for fake news, for further machine learning model building.

However, I am not sure fighting fake news is enough. Many Americans are not simply concerned by the prevalence of fake news, but also the narration because of our ideological bias. Sometimes we are not satisfied because we think the news is not “neutral” enough, or, it does not fit our worldview.

The slides can be found here, and the video of the talk can be found here.

- “Fake News as a Data Science Challange,” Data Science DC (Aug 9, 2017). [Meetup] [slides on Google Drive] [Video on Facebook]
- Jennifer Golbeck. [HTML]
- Benford’s Law. [Wikipedia]
- Jennifer Golbeck, “Benford’s Law Applies to Online Social Networks,”
*PLoS ONE*10.8: e0135169 (2015). [PLoS] - Fake News Challenge. [HTML]

Featured image taken from http://www.livingroomconversations.org/fake_news

]]>