If you have been taking Andrew Ng’s deeplearning.ai course on Coursera, you must have learned in Course 1 about the graph operations, and the method of back propagation using derivatives in terms of graph. In fact, it is the basis of TensorFlow, a Python package commonly used in deep learning. Because it is based on the graph model of computation, we can see it as a “programming language.”

Google published a paper about the big picture of computational model in TensorFlow:

TensorFlow is a powerful, programmable system for machine learning. This paper aims to provide the basics of a conceptual framework for understanding the behavior of TensorFlow models during training and inference: it describes an operational semantics, of the kind common in the literature on programming languages. More broadly, the paper suggests that a programming-language perspective is fruitful in designing and in explaining systems such as TensorFlow.

Beware that this model is not limited to deep learning.

There are many embeddings algorithm for representations. Sammon embedding is the oldest one, and we have Word2Vec, GloVe, FastText etc. for word-embedding algorithms. Embeddings are useful for dimensionality reduction.

Traditionally, quantum many-body states are represented by Fock states, which is useful when the excitations of quasi-particles are the concern. But to capture the quantum entanglement between many solitons or particles in a statistical systems, it is important not to lose the topological correlation between the states. It has been known that restricted Boltzmann machines (RBM) have been used to represent such states, but it has its limitation, which Xun Gao and Lu-Ming Duan have stated in their article published in Nature Communications:

There exist states, which can be generated by a constant-depth quantum circuit or expressed as PEPS (projected entangled pair states) or ground states of gapped Hamiltonians, but cannot be efficiently represented by any RBM unless the polynomial hierarchy collapses in the computational complexity theory.

PEPS is a generalization of matrix product states (MPS) to higher dimensions. (See this.)

However, Gao and Duan were able to prove that deep Boltzmann machine (DBM) can bridge the loophole of RBM, as stated in their article:

Any quantum state of n qubits generated by a quantum circuit of depth T can be represented exactly by a sparse DBM with O(nT) neurons.

(diagram adapted from Gao and Duan’s article)

Quantum computing has been picking up the momentum, and there are many startups and scholars discussing quantum machine learning. A basic knowledge of quantum two-level computation ought to be acquired.

Recently, Rigetti, a startup for quantum computing service in Bay Area, published that they opened to public their cloud server for users to simulate the use of quantum instruction language, as described in their blog and their White Paper. It is free.

Go to their homepage, http://rigetti.com/, click on “Get Started,” and fill in your information and e-mail. Then you will be e-mailed keys of your cloud account. Copy the information to a file .pyquil_config, and in your .bash_profile, add a line

export PYQUIL_CONFIG="\$HOME/.pyquil_config"

More information can be found in their Installation tutorial. Then install the Python package pyquil, by typing in the command line:

pip install -U pyquil

Some of you may need to root (adding sudo in front).

Then we can go ahead to open Python, or iPython, or Jupyter notebook, to play with it. For the time being, let me play with creating an entangled singlet state, $\frac{1}{\sqrt{2}} (|01\rangle - |10\rangle )$. The corresponding quantum circuit is like this:

First of all, import all necessary libraries:

import numpy as np

from pyquil.quil import Program
import pyquil.api as api
from pyquil.gates import H, X, Z, CNOT


You can see that the package includes a lot of quantum gates. First, we need to instantiate a quantum simulator:

# starting the quantum simulator
quantum_simulator = api.SyncConnection()


Then we implement the quantum circuit with a “program” as follow:

# generating singlet state
# 2. Pauli-Z
# 3. CNOT
# 4. NOT
p = Program(H(0), Z(0), CNOT(0, 1), X(1))
wavefunc, _ = quantum_simulator.wavefunction(p)


The last line gives the final wavefunction after running the quantum circuit, or “program.” For the ket, the rightmost qubit is qubit 0, and the left of it is qubit 1, and so on. Therefore, in the first line of the program, H, the Hadamard gate, acts on qubit 0, i.e., the rightmost qubit. Running a simple print statement:

print wavefunc


gives

(-0.7071067812+0j)|01> + (0.7071067812+0j)|10>

The coefficients are complex, and the imaginary part is described by j. You can extract it as a numpy array:

wavefunc.amplitudes


If we want to calculate the metric of entanglement, we can use the Python package pyqentangle, which can be installed by running on the console:

pip install -U pyqentangle

Import them:

from pyqentangle import schmidt_decomposition
from pyqentangle.schmidt import bipartitepurestate_reduceddensitymatrix
from pyqentangle.metrics import entanglement_entropy, negativity


Because pyqentangle does not recognize the coefficients in the same way as pyquil, but see each element as the coefficients of $|j i \rangle$, we need to reshape the final state first, by:

tensorcomp = wavefunc.amplitudes.reshape((2, 2))


Then perform Schmidt decomposition (which the Schmidt modes are actually trivial in this example):

# Schmidt decomposition
schmidt_modes = schmidt_decomposition(tensorcomp)
for prob, modeA, modeB in schmidt_modes:
print prob, ' : ', modeA, ' ', modeB


This outputs:

0.5  :  [ 0.+0.j  1.+0.j]   [ 1.+0.j  0.+0.j]
0.5  :  [-1.+0.j  0.+0.j]   [ 0.+0.j  1.+0.j]

Calculate the entanglement entropy and negativity from its reduced density matrix:

print 'Entanglement entropy = ', entanglement_entropy(bipartitepurestate_reduceddensitymatrix(tensorcomp, 0))
print 'Negativity = ', negativity(bipartitepurestate_reduceddensitymatrix(tensorcomp, 0))


which prints:

Entanglement entropy =  0.69314718056
Negativity =  -1.11022302463e-16

The calculation can be found in this thesis.

P.S.: The circuit was drawn by using the tool in this website, introduced by the Marco Cezero’s blog post. The corresponding json for the circuit is:

{"gate":[],{"gate":[], "circuit": [{"type":"h", "time":0, "targets":[0], "controls":[]},     {"type":"z", "time":1, "targets":[0], "controls":[]},     {"type":"x", "time":2, "targets":[1], "controls":[0]},     {"type":"x", "time":3, "targets":[1], "controls":[]}], "qubits":2,"input":[0,0]}
Continue reading "A First Glimpse of Rigetti’s Quantum Computing Cloud" →

The Python package for text mining shorttext has a new release: 0.5.4. It can be installed by typing in the command line:

pip install -U shorttext

For some people, you may need to install it from “root”, i.e., adding sudo in front of the command. Since the version 0.5 (including releases 0.5.1 and 0.5.4), there have been substantial addition of functionality, mostly about comparisons between short phrases without running a supervised or unsupervised machine learning algorithm, but calculating the “similarity” with various metrics, including:

• soft Jaccard score (the same kind of fuzzy scores based on edit distance in SOCcer),
• Word Mover’s distance (WMD, detailedly described in a previous post), and
• Jaccard index due to word-embedding model.

For the soft Jaccard score due to edit distance, we can call it by:

>>> from shorttext.metrics.dynprog import soft_jaccard_score
>>> soft_jaccard_score(['book', 'seller'], ['blok', 'sellers'])     # gives 0.6716417910447762
>>> soft_jaccard_score(['police', 'station'], ['policeman'])        # gives 0.2857142857142858

The core of this code was written in C, and interfaced to Python using SWIG.

For the Word Mover’s Distance (WMD), while the source codes are the same as my previous post, it can now be called directly. First, load the modules and the word-embedding model:

>>> from shorttext.metrics.wasserstein import word_mover_distance
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')

And compute the WMD with a single function:

>>> word_mover_distance(['police', 'station'], ['policeman'], wvmodel)                      # gives 3.060708999633789
>>> word_mover_distance(['physician', 'assistant'], ['doctor', 'assistants'], wvmodel)      # gives 2.276337146759033

And the Jaccard index due to cosine distance in Word-embedding model can be called like this:

>>> from shorttext.metrics.embedfuzzy import jaccardscore_sents
>>> jaccardscore_sents('doctor', 'physician', wvmodel)   # gives 0.6401538990056869
>>> jaccardscore_sents('chief executive', 'computer cluster', wvmodel)   # gives 0.0022515450768836143
>>> jaccardscore_sents('topological data', 'data of topology', wvmodel)   # gives 0.67588977344632573

Most new functions can be found in this tutorial.

And there are some minor bugs fixed.

Much about the use of word-embedding models such as Word2Vec and GloVe have been covered. However, how to measure the similarity between phrases or documents? One natural choice is the cosine similarity, as I have toyed with in a previous post. However, it smoothed out the influence of each word. Two years ago, a group in Washington University in St. Louis proposed the Word Mover’s Distance (WMD) in a PMLR paper that captures the relations between words, not simply by distance, but also the “transportation” from one phrase to another conveyed by each word. This Word Mover’s Distance (WMD) can be seen as a special case of Earth Mover’s Distance (EMD), or Wasserstein distance, the one people talked about in Wasserstein GAN. This is better than bag-of-words (BOW) model in a way that the word vectors capture the semantic similarities between words.

Word Mover’s Distance (WMD)

The formulation of WMD is beautiful. Consider the embedded word vectors $\mathbf{X} \in R^{d \times n}$, where $d$ is the dimension of the embeddings, and $n$ is the number of words. For each phrase, there is a normalized BOW vector $d \in R^n$, and $d_i = \frac{c_i}{\sum_i c_i}$, where $i$‘s denote the word tokens. The distance between words are the Euclidean distance of their embedded word vectors, denoted by $c(i, j) = || \mathbf{x}_i - \mathbf{x}_j ||_2$, where $i$ and $j$ denote word tokens. The document distance, which is WMD here, is defined by $\sum_{i, j} \mathbf{T}_{i j} c(i, j)$, where $\mathbf{T}$ is a $n \times n$ matrix. Each element $\mathbf{T}_{ij} \geq 0$ denote how nuch of word $i$ in the first document (denoted by $\mathbf{d}$) travels to word $j$ in the new document (denoted by $\mathbf{d}'$).

Then the problem becomes the minimization of the document distance, or the WMD, and is formulated as:

$\text{min}_{\mathbf{T} \geq 0} \sum_{i, j=1}^n \mathbf{T}_{ij} c(i, j)$,

given the constraints:

$\sum_{j=1}^n \mathbf{T}_{ij} = d_i$, and

$\sum_{i=1}^n \mathbf{T}_{ij} = d_j'$.

This is essentially a simplified case of the Earth Mover’s distance (EMD), or the Wasserstein distance. (See the review by Gibbs and Su.)

Using PuLP

The WMD is essentially a linear optimization problem. There are many optimization packages on the market, and my stance is that, for those common ones, there are no packages that are superior than others. In my job, I happened to handle a missing data problem, in turn becoming a non-linear optimization problem with linear constraints, and I chose limSolve, after I shop around. But I actually like a lot of other packages too. For WMD problem, I first tried out cvxopt first, which should actually solve the exact same problem, but the indexing is hard to maintain. Because I am dealing with words, it is good to have a direct hash map, or a dictionary. I can use the Dictionary class in gensim. But I later found out I should use PuLP, as it allows indices with words as a hash map (dict in Python), and WMD is a linear programming problem, making PuLP is a perfect choice, considering code efficiency.

An example of using PuLP can be demonstrated by the British 1997 UG Exam, as in the first problem of this link, with the Jupyter Notebook demonstrating this.

Implementation of WMD using PuLP

The demonstration can be found in the Jupyter Notebook.

from itertools import product
from collections import defaultdict

import numpy as np
from scipy.spatial.distance import euclidean
import pulp
import gensim

Then define the functions the gives the BOW document vectors:

def tokens_to_fracdict(tokens):
cntdict = defaultdict(lambda : 0)
for token in tokens:
cntdict[token] += 1
totalcnt = sum(cntdict.values())
return {token: float(cnt)/totalcnt for token, cnt in cntdict.items()}

Then implement the core calculation. Note that PuLP is actually a symbolic computing package. This function return a pulp.LpProblem class:

def word_mover_distance_probspec(first_sent_tokens, second_sent_tokens, wvmodel, lpFile=None):
all_tokens = list(set(first_sent_tokens+second_sent_tokens))
wordvecs = {token: wvmodel[token] for token in all_tokens}

first_sent_buckets = tokens_to_fracdict(first_sent_tokens)
second_sent_buckets = tokens_to_fracdict(second_sent_tokens)

T = pulp.LpVariable.dicts('T_matrix', list(product(all_tokens, all_tokens)), lowBound=0)

prob = pulp.LpProblem('WMD', sense=pulp.LpMinimize)
prob += pulp.lpSum([T[token1, token2]*euclidean(wordvecs[token1], wordvecs[token2])
for token1, token2 in product(all_tokens, all_tokens)])
for token2 in second_sent_buckets:
prob += pulp.lpSum([T[token1, token2] for token1 in first_sent_buckets])==second_sent_buckets[token2]
for token1 in first_sent_buckets:
prob += pulp.lpSum([T[token1, token2] for token2 in second_sent_buckets])==first_sent_buckets[token1]

if lpFile!=None:
prob.writeLP(lpFile)

prob.solve()

return prob

To extract the value, just run pulp.value(prob.objective)

We use Google Word2Vec. Refer the $\mathbf{T}$ matrices in the Jupyter Notebook. Running this by a few examples:

1. document1 = President, talk, Chicago
document2 = President, speech, Illinois
WMD = 2.88587622936
2. document1 = physician, assistant
document2 = doctor
WMD = 2.8760048151
3. document1 = physician, assistant
document2 = doctor, assistant
WMD = 1.00465738773
(compare with example 2!)
4. document1 = doctors, assistant
document2 = doctor, assistant
WMD = 1.02825379372
(compare with example 3!)
5. document1 = doctor, assistant
document2 = doctor, assistant
WMD = 0.0
(totally identical; compare with example 3!)

There are more examples in the notebook.

Conclusion

WMD is a good metric comparing two documents or sentences, by capturing the semantic meanings of the words. It is more powerful than BOW model as it captures the meaning similarities; it is more powerful than the cosine distance between average word vectors, as the transfer of meaning using words from one document to another is considered. But it is not immune to the problem of misspelling.

This algorithm works well for short texts. However, when the documents become large, this formulation will be computationally expensive. The author actually suggested a few modifications, such as the removal of constraints, and word centroid distances.

Example codes can be found in my Github repository: stephenhky/PyWMD.

People have been upset about the prevalence of fake news since the election season last year. Election has been a year, but fake news is still around because the society is still politically charged. Some tech companies vowed to fight against fake news, but, easy to imagine, this is a tough task.

On Aug 9, 2017, Data Science DC held an event titled “Fake News as a Data Science Challenge, ” spoken by Professor Jen Golbeck from University of Maryland. It is an interesting talk.

Fake news itself is a big problem. It has philosophical, social, political, or psychological aspects, but Prof. Golbeck focused on its data science aspect. But to make it a computational problem, a clear and succinct definition of “fake news” has to be present, but it is already challenging. Some “fake news” is pun intended, or sarcasm, or jokes (like The Onion). Some misinformation is shared through Twitter or Facebook not because of deceiving purpose. Then a line to draw is difficult. But the undoubtable part is that we want to fight against news with malicious intent.

To fight fake news, as Prof. Golbeck has pointed out, there are three main tasks:

1. detecting the content;
2. detecting the source; and
3. modifying the intent.

Statistical tools can be exploited too. She talked about Benford’s law, which states that, in naturally occurring systems, the frequency of numbers’ first digits is not evenly distributed. Anomaly in the distribution of some news can be used as a first step of fraud detection. (Read her paper.)

There are also efforts, Fake News Challenge for example, in building corpus for fake news, for further machine learning model building.

However, I am not sure fighting fake news is enough. Many Americans are not simply concerned by the prevalence of fake news, but also the narration because of our ideological bias. Sometimes we are not satisfied because we think the news is not “neutral” enough, or, it does not fit our worldview.

The slides can be found here, and the video of the talk can be found here.

On 07/28/2017, shorttext published its release 0.4.1, with a few important updates. To install it, type the following in the OS X / Linux command line:

>>> pip install -U shorttext

The documentation in PythonHosted.org has been abandoned. It has been migrated to readthedocs.org. (URL: http://shorttext.readthedocs.io/ or http:// shorttext.rtfd.io)

Exploiting the Word-Embedding Layer

This update is mainly due to an important update in gensim, motivated by earlier shorttext‘s effort in integrating scikit-learn and keras. And gensim also provides a keras layer, on the same footing as other neural networks, activation function, or dropout layers, for Word2Vec models. Because shorttext has been making use of keras layers for categorization, such advance in gensim in fact makes it a natural step to add an embedding layer of all neural networks provided in shorttext. How to do it? (See shorttext tutorial for “Deep Neural Networks with Word Embedding.”)

import shorttext
trainclassdict = shorttext.data.subjectkeywords() &nbsp; # load an example data set


To train a model, you can do it the old way, or do it the new way with additional gensim function:

kmodel = shorttext.classifiers.frameworks.CNNWordEmbed(wvmodel=wvmodel, nb_labels=len(trainclassdict.keys()), vecsize=100, with_gensim=True) &nbsp; # keras model, setting with_gensim=True
classifier = shorttext.classifiers.VarNNEmbeddedVecClassifier(wvmodel, with_gensim=True, vecsize=100) &nbsp; # instantiate the classifier, setting with_gensim=True
classifier.train(trainclassdict, kmodel)


The parameters with_gensim in both CNNWordEmbed and VarNNEmbeddedVecClassifier are set to be False by default, because of backward compatibility. However, setting it to be True will enable it to use the new gensim Word2Vec layer.

These change in gensim and shorttext are the works mainly contributed by Chinmaya Pancholi, a very bright student at Indian Institute of Technology, Kharagpur, and a GSoC (Google Summer of Code) student in 2017. He revolutionized gensim by integrating scikit-learn and keras into gensim. He also used what he did in gensim to improve the pipelines of shorttext. He provided valuable technical suggestions. You can read his GSoC proposal, and his blog posts in RaRe Technologies, Inc. Chinmaya has been diligently mentored by Ivan Menshikh and Lev Konstantinovskiy of RaRe Technologies.

Maxent Classifier

Another important update is the adding of maximum entropy (maxent) classifier. (See the corresponding tutorial on “Maximum Entropy (MaxEnt) Classifier.”) I will devote a separate entry on the theory, but it is very easy to use it,

import shorttext
from shorttext.classifiers import MaxEntClassifier

classifier = MaxEntClassifier()


Use the NIHReports dataset as the example:

classdict = shorttext.data.nihreports()
classifier.train(classdict, nb_epochs=1000)


The classification is just like other classifiers provided by shorttext:

classifier.score('cancer immunology') # NCI tops the score
classifier.score('children health') # NIAID tops the score
classifier.score('Alzheimer disease and aging') # NIAID tops the score


P.S.: this blog entry is long overdue.

On June 16, there was an event held by Data Science MD on natural language processing (NLP). The first speaker was Brian Sacash, a data scientist at Deloitte, and his talk was titled NLP and Sentiment Analysis, which is a good demonstration on the Python package nltk, and its application on sentiment analysis. His approach is knowledge-based, and its quite different from the talk given by Michael Cherny, as presented in his talk in DCNLP and his blog. (See his article.) Brian has a lot of demonstration codes in Jupyter notebook in his Github.

The second speaker was Dr. Daniel Russ, a staff scientist at National Institutes of Health (NIH) and my colleague. His talk was titled It Takes a Village To Solve A Problem in Data Science, stressing the amount of brains and powers involved in solving a data science problem in businesses. He focused on the SOCcer project, (see a previous blog post) which I am also a part of the team, and also the interaction with Apache OpenNLP project. (Slideshare: It Takes a Village To Solve A Problem in Data Science from DataScienceMD)

A preprint on arXiv recently caught a lot of attentions. While deep learning is successful in various types of neural networks, it had not been so for feed-forward neural networks. The authors of this paper proposed normalizing the network with a new activation function, called “selu” (scaled exponential linear units):

$\text{selu}(x) =\lambda \left\{ \begin{array}{cc} x & \text{if } x>0 \\ \alpha e^x - \alpha & \text{if } x \leq 0 \end{array} \right.$.

which is an improvement to the existing “elu” function.

Despite this achievement, what caught the eyeballs is not the activation function, but the 93-page appendix of mathematical proof:

And this is one of the pages in the appendix:

Some scholars teased at it on Twitter too: