Deep learning has achieved a big success in the past few years, but its interpretive power is limited. They work largely because of the abundance of data. On the other hand, traditional machine learning algorithms are much better in interpretive power, but manual feature engineering costs a lot, due to the lack of data in earlier era. In light of this, a group of scientists initiated the work of graph networks, aiming at devising new artificial intelligence algorithms that exploits the advantages of two worlds, while still holding the principle of combinatorial generalization in constructing methods by using known building blocks to build new methods. Graph is good at interpretation as it is good for relational representation.

The use of graph networks is more than the graph convolutional neural networks (GCN) in the previous two blog entries. (part I and part II) However, to achieve relational inductive biases, an entity (an element with attributes), a relation, (a property between entities) and a rule. (a function that maps entities and relations to other entities and relations) This can be realized using graph, which is a mathematical structure that contains nodes and edges (that connect nodes.) To generalize the use of graph networks in various machine learning and deep learning methods, they reviewed the graph block, which is basically a function, or a mapping, from a graph to another graph, as shown in the algorithm below:

Works of graph networks are not non-existent; the authors listed previous works that can be seen as graph networks, for example:

• Message-passing neural network (MPNN) (2017);
• Non-local neural networks (NLNN) (2018).

The use of graph networks, I believe, is the next trend. There have been works regarding the graph-powered machine learning. (see Google AI blog, GraphAware Slideshare) I recently started an open-source project, a Python package called graphflow, to explore various algorithms using graphs, including PageRank, HITS, resistance, and non-linear resistance.

It has been a while since I wrote about topological data analysis (TDA). For pedagogical reasons, a lot of the codes were demonstrated in the Github repository PyTDA. However, it is not modularized as a package, and those codes run in Python 2.7 only.

Upon a few inquiries, I decided to release the codes as a PyPI package, and I named it mogutda, under the MIT license. It is open-source, and the codes can be found at the Github repository MoguTDA. It runs in Python 2.7, 3.5, and 3.6.

The first presidential debate 2016 was held on September 26, 2016 in Hofstra University in New York. An interesting analysis will be the literacy level demonstrated by the two candidates using Flesch readability ease and Flesch-Kincaid grade level, demonstrated in my previous blog entry and my Github: stephenhky/PyReadability.

First, we need to get the transcript of the debate, which can be found in an article in New York Times. Copy and paste the text into a file called first_debate_transcript.txt. Then we want to extract out speech of each person. To do this, store the following Python code in first_debate_segment.py.

# Trump and Clinton 1st debate on Sept 26, 2016

from nltk import word_tokenize
from collections import defaultdict
import re

def untokenize(words):
"""
Untokenizing a text undoes the tokenizing operation, restoring
punctuation and spaces to the places that people expect them to be.
Ideally, untokenize(tokenize(text)) should be identical to text,
except for line breaks.
"""
text = ' '.join(words)
step1 = text.replace(" ", '"').replace(" ''", '"').replace('. . .',  '...')
step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
step3 = re.sub(r' ([.,:;?!%]+)([ \'"])', r"\1\2", step2)
step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3) step5 = step4.replace(" '", "'").replace(" n't", "n't").replace( "can not", "cannot") step6 = step5.replace("  ", " '") return step6.strip() ignored_phrases = ['(APPLAUSE)', '(CROSSTALK)'] persons = ['TRUMP', 'CLINTON', 'HOLT'] fin = open('first_debate_transcript.txt', 'rb') lines = fin.readlines() fin.close() lines = filter(lambda s: len(s)>0, map(lambda s: s.strip(), lines)) speeches = defaultdict(lambda : '') person = None for line in lines: tokens = word_tokenize(line.strip()) ignore_colon = False added_tokens = [] for token in tokens: if token in ignored_phrases: pass elif token in persons: person = token ignore_colon = True elif token == ':': ignore_colon = False else: added_tokens += [token] speeches[person] += ' ' + untokenize(added_tokens) for person in persons: fout = open('speeches_'+person+'.txt', 'wb') fout.write(speeches[person]) fout.close()  There is an untokenize function adapted from a code in StackOverflow. This segmented the transcript into the individual speech of Lester Holt (the host of the debate), Donald Trump (GOP presidential candidate), and Hillary Clinton (DNC presidential candidate) in separate files. Then, on UNIX or Linux command line, run score_readability.py on each person’s script, by, for example, for Holt’s speech, python score_readability.py speeches_HOLT.txt --utf8 Beware that it is encoded in UTF-8. For Lester Holt, we have Word count = 1935 Sentence count = 157 Syllable count = 2732 Flesch readability ease = 74.8797052289 Flesch-Kincaid grade level = 5.87694629602 For Donald Trump, Word count = 8184 Sentence count = 693 Syllable count = 10665 Flesch readability ease = 84.6016324536 Flesch-Kincaid grade level = 4.3929136992 And for Hillary Clinton, Word count = 6179 Sentence count = 389 Syllable count = 8395 Flesch readability ease = 75.771973015 Flesch-Kincaid grade level = 6.63676650035 Apparently, compared to Donald Trump, Hillary Clinton has a higher literary level, but her speech is less easy to understand. Recalling from my previous entry, for Shakespeare’s MacBeth, the Flesch readability ease is 112.278048591, and Flesch-Kincard grade level 0.657934056288; for King James Version Bible (KJV), they are 79.6417489428 and 9.0085275366 respectively. This is just a simple text analytics. However, the content is not analyzed here. Augustine of Hippo wrote in his Book IV of On Christian Teaching (Latin: De doctrina christiana) about rhetoric and eloquence: “… wisdom without eloquence is of little value to the society… eloquence without wisdom is… a great nuisance, and never beneficial.” — Augustine of Hippo, Book IV of On Christian Teaching Ever since Mehta and Schwab laid out the relationship between restricted Boltzmann machines (RBM) and deep learning mathematically (see my previous entry), scientists have been discussing why deep learning works so well. Recently, Henry Lin and Max Tegmark put a preprint on arXiv (arXiv:1609.09225), arguing that deep learning works because it captures a few essential physical laws and properties. Tegmark is a cosmologist. Physical laws are simple in a way that a few properties, such as locality, symmetry, hierarchy etc., lead to large-scale, universal, and often complex phenomena. A lot of machine learning algorithms, including deep learning algorithms, have deep relations with formalisms outlined in statistical mechanics. A lot of machine learning algorithms are basically probability theory. They outlined a few types of algorithms that seek various types of probabilities. They related the probabilities to Hamiltonians in many-body systems. They argued why neural networks can approximate functions (polynomials) so well, giving a simple neural network performing multiplication. With central limit theorem or Jaynes’ arguments (see my previous entry), a lot of multiplications, they said, can be approximated by low-order polynomial Hamiltonian. This is like a lot of many-body systems that can be approximated by 4-th order Landau-Ginzburg-Wilson (LGW) functional. Properties such as locality reduces the number of hyper-parameters needed because it restricts to interactions among close proximities. Symmetry further reduces it, and also computational complexities. Symmetry and second order phase transition make scaling hypothesis possible, leading to the use of the tools such as renormalization group (RG). As many people have been arguing, deep learning resembles RG because it filters out unnecessary information and maps out the crucial features. Tegmark use classifying cats vs. dogs as an example, as in retrieving temperatures of a many-body systems using RG procedure. They gave a counter-example to Schwab’s paper with the probabilities cannot be preserved by RG procedure, but while it is sound, but it is not the point of the RG procedure anyway. They also discussed about the no-flattening theorems for neural networks. There are many tasks that involve coding, for example, putting kids into groups according to their age, labeling the webpages about their kinds, or putting students in Hogwarts into four colleges… And researchers or lawyers need to code people, according to their filled-in information, into occupations. Melissa Friesen, an investigator in Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NCI), National Institutes of Health (NIH), saw the need of large-scale coding. Many researchers are dealing with big data concerning epidemiology. She led a research project, in collaboration with Office of Intramural Research (OIR), Center for Information Technology (CIT), National Institutes of Health (NIH), to develop an artificial intelligence system to cope with the problem. This leads to a publicly available tool called SOCcer, an acronym for “Standardized Occupation Coding for Computer-assisted Epidemiological Research.” (URL: http://soccer.nci.nih.gov/soccer/) The system was initially developed in an attempt to find the correlation between the onset of cancers and other diseases and the occupation. “The application is not intended to replace expert coders, but rather to prioritize which job descriptions would benefit most from expert review,” said Friesen in an interview. She mainly works with Daniel Russ in CIT. SOCcer takes job title, industry codes (in terms of SIC, Standard Industrial Classification), and job duties, and gives an occupational code called SOC 2010 (Standard Occupational Classification), used by U. S. federal government agencies. The data involves short text, often messy. There are 840 codes in SOC 2010 systems. Conventional natural language processing (NLP) methods may not apply. Friesen, Russ, and Kwan-Yuet (Stephen) Ho (also in OIR, CIT; a CSRA staff) use fuzzy logic, and maximum entropy (maxent) methods, with some feature engineering, to build various classifiers. These classifiers are aggregated together, as in stacked generalization (see my previous entry), using logistic regression, to give a final score. SOCcer has a companion software, called SOCAssign, for expert coders to prioritize the codings. It was awarded with DCEG Informatics Tool Challenge 2015. SOCcer itself was awarded in 2016. And the SOCcer team was awarded for Scientific Award of Merit by CIT/OCIO in 2016 as well (see this). Their work was published in Occup. Environ. Med. Biomedical research and clinical data are often collected on the same sample of data at different points in time. These data are called “longitudinal data.” (See the definition by BLS.) When performing supervised learning (e.g., SVM) on data of this kind, the impact of time-varying correlation of the features on the outcomes / predictions may be blurred. In order to smoothing out the temporal effect of the features, changes to the original learning algorithms are necessary. In a study conducted by Center for Information Technology (CIT), and National Institutes on Aging (NIA) in National Institutes of Health (NIH), with some clinical data as the training data, a longitudinal support vector regression (LSVR) algorithm was presented, and shown to outperform other machine learning methods. [Du et. al. 2015] Their results were published in IEEE BIBM (Bioinformatics and Biomedicine) conference. Their work is adapted from an earlier work by Chen and Bowman. [Chen & Bowman 2011] The dataset is a longitudinal, because it contains N patients with p features, taken at T points in time. Traditional support vector regression (SVR) is to solve the following optimization problem: $\text{min} \frac{1}{2} || \mathbf{w} ||^2$, where $\mathbf{w}$ is a hyperplane surface, under the constraints: $y_i - \mathbf{w^T} \Phi (\mathbf{x}_i) - b \leq \epsilon$, $\mathbf{w^T} \Phi (\mathbf{x}_i) + b - y_i \leq \epsilon$. However, in LSVR, the data points are more complicated. For each patient s, its features at time t is given by a vector $\mathbf{x}_s^{(t)}$. The first goal of LSVR is to assign each patient a T-by-p matrix $\tilde{\mathbf{x}}_s = \left[ \mathbf{x}_s^{(1)}, \mathbf{x}_s^{(2)}, \ldots, \mathbf{x}_s^{(T)} \right]^T$, and a T-by-1 vector $\tilde{y}_s =\left[ y_s^{(1)}, y_s^{(2)}, \ldots, y_s^{(T)} \right]^T$, with an unknown parameter vector $\mathbf{\beta} = \left[ 1, \beta_1, \beta_2, \ldots, \beta_{T-1} \right]^T$ such that the constraints becomes: $\tilde{y}_i^T \beta - \langle \mathbf{w^T}, \mathbf{\tilde{x}_i}^T \beta \rangle - b \leq \epsilon + \xi_i$, $\langle \mathbf{w^T}, \mathbf{\tilde{x}_i}^T \beta \rangle+ b -\tilde{y}_i^T \beta \leq \epsilon + \xi_i^{*}$, where $\xi_i$‘s are additional regularization parameters. The parameters $\beta$‘s can be found by iteratively quadratic optimization. The constraints are handled with Lagrangian’s multipliers. For details, please refer to [Du et. al. 2015]. This way decouples, or smoothes out, the temporal covariation within the patients. A better prediction can be made. Previously, I have went through heuristically the description of topology using homology groups in this entry. [Ho 2015] This is the essence of algebraic topology. We describe the topology using Betti numbers, the rank of the homolog groups. What they mean can be summarized as: [Bubenik 2015] “… homology in degree 0 describes the connectedness of the data; homology in degree 1 detects holes and tunnels; homology in degree 2 captures voids; and so on. ## Concept of Persistence However, in computational problems, it is the discrete points that we are dealing with. We formulate their connectedness through constructing complexes, as described by my another blog entry. [Ho 2015] From the Wolfram Demonstration that I quoted previously, connectedness depends on some parameters, such as the radii of points that are considered connected. Whether it is Čech Complex, RP complex, or Alpha complex, the idea is similar. With discrete data, therefore, there is no definite answer how the connectedness among the points are, as it depends on the parameters. Therefore, the concept of persistence has been developed to tackle this problem. This is the core concept for computational topology. There are a lot of papers about persistence, but the most famous work is done by Zomorodian and Carlsson, who algebraically studied it. [Zomorodian & Carlsson 2005] The idea is that as one increases the radii of points, the complexes change, and so do the homology groups. By varying the radii, we can observe which topology persists. From the diagram above, we can see that as the radii ε increase, the diagram becomes more connected. To understand the changes of homologies, there are a few ways. In the diagram above, barcodes represent the “life span” of a connected component as ε increases. The Betti numbers of a certain degree (0, 1, or 2 in this example) at a certain value of ε is the number of barcodes at that degree. For example, look at the left most vertical dashed line, $\beta_0=10$, as there are 10 barcodes existing for $H_0$. Note there are indeed 10 separate connected components. For the second leftmost vertical dashed line, $\beta_0=6$ (6 connected components), and $\beta_1=2$ (2 holes). Another way is using the persistence diagram, basically plotting the “birth” and “death” times of all the barcodes above. For an explanation of persistence diagram, please refer to this blog entry by Sebastien Bubeck, [Bubeck 2013] or the paper by Fasy et. al. [Fasy et. al. 2014] Another way to describe persistent topology is the persistence landscape. [Bubenik 2015] ## TDA Package in R There are a lot of tools to perform topological data analysis. Ayasdi Core is a famous one. There are open sources C++ libraries such as Dionysus, or PHAT. There is a Python binding for Dionysus too. There is a package in R that wraps Dionysus and PHAT, called TDA. To install it, simply open an R session, and enter install.package('TDA')  To load it, simply enter library(TDA)  We know that for a circle, $\beta_0=\beta_1=1$, as it has on connected components, and a hole. Prepare the circle and store it in X by the function circleUnif: X<- circleUnif(n=1000, r=1) plot(X)  Then we can see a 2-dimensional circle like this: To calculate the persistent homology, use the function gridDiag: diag.info<- gridDiag(X=X, FUN=kde, h=0.3, lim=cbind(c(-1, 1), c(-1, 1)), by=0.01, sublevel = FALSE, library = 'PHAT', printProgress=FALSE)  To plot the barcodes and persistence diagram, enter: par(mfrow=c(2,1)) plot(diag.info$diagram)
plot(diag.info\$diagram, barcode=TRUE)


In the plots, black refers to degree 0, and red refers to degree 1.

We can play the same game by adding a horizontal line to cut the circle into two halves:

X<- circleUnif(n=1000, r=1)
hl<- matrix(c(seq(-1, 1, 0.02), rep(0, 201)), ncol=2)
X<- rbind(X, hl)
plot(X)


And the barcodes and persistence diagram are:

We can try this with three-dimensional objects like sphere, or torus, but I never finished the calculation in reasonable speeds.

Text mining can be applied on rap lyrics.

Today I attended an event organized by Data Science MD Meetup Group, a talk titled “Lose Yourself in Rapalytics,” by Abhay, a PhD student in University of Maryland, Baltimore County (UMBC). Rapalytics is an online tool analyzing raps.

It is another task of text mining and natural language processing. He mentioned a few common tools. However, he also specifically looked at rhymes (as rhyme is an important element of rap lyrics), and profanity (as rap music is commonly, or stereotypically, dirty).

Play with it!

Topology has been shown to reveal important information about geometry and shape from data, [Carlsson 2015][Carlsson 2009] as I have talked about in various TDA blog entries. I have also demonstrated how to describe the topology if discrete data points by constructing simplicial complexes, and then calculated the homology and Betti numbers. (I will talk about persistent homology in the future.) Dealing with discrete data points in our digital storage devices, homology is the best way to describe it.

But if you are from a physics background, you may be familiar with the concept of homotopy and fundamental group. Some physicists deal with topology without digging into advanced mathematical tools but simply through solitons. There is a well-written introduction in this blog. In the physical world, an object is said to be topological if:

• there is a singular point that cannot be removed by a continuous deformation of field; [Mermin 1979]
• it has a saddle-point equation of the model that is different from another object of another topology, [Rajaraman 1987] inducing different kinds of physical dynamics; [Bray 1994]
• it can only be removed by crossing an energy barrier, which can be described by an instanton; [Calzetta, Ho, Hu 2010]
• it can proliferate through Kosterlitz-Thouless (BKT) phase transition; [Kosterliz, Thouless 1973]
• it can form in a system through a second-order phase transition at a finite rate, a process known as Kibble-Zurek mechanism (KZM); [Kibble 1976] [Zurek 1985] and
• its topology can be described by a winding number. (c.f. Betti numbers in homology)

Topological objects include vortices in magnets, superfluids, superconductors, or Skyrmions in helimagnets. [Mühlbauer et. al. 2009] [Ho et. al. 2010] They may come in honeycomb order, like Abrikosov vortices in type-II superconductors, [Abrikosov 1957] and helical nanofilaments in smectics. [Matsumoto et. al. 2009] It is widely used in fractional quantum Hall effect [Tsui et. al. 1982] and topological insulators (a lot of references can be found…). They can all be described using homotopy and winding numbers. We can see that topology is useful to describe the physical world for the complexities and patterns. There are ideas in string-net theory to use topology to describe the emergence of patterns and new phases of quantum matter. [Zeng et. al. 2015] Of course, I must not omit topological quantum computing that makes the qubits immune to environmental noise. [Das Sarma, Freedman, Nayak 2005]

However in data analytics, we do not use homotopy, albeit its beauty and usefulness in the physical world. Here are some of the reasons:

• In using homotopy, sometimes it takes decades for a lot of brains to figure out which homotopy groups to use. But in data analysis, we want to grasp the topology simply from data.
• Homotopy deals with continuous mappings, but data are discrete. Simplicial homology captures it more easily.
• In a physical system, we deal with usually one type of homotopy groups. But in data, we often deal with various topologies which we are not aware of in advance. Betti numbers can describe the topology easily by looking at data.
• Of course, homotopy is difficult to compute numerically.

Afra Zomorodian argued the use of homology over homotopy in his book as well. [Zomorodian 2009]