Text mining can be applied on rap lyrics.

Today I attended an event organized by Data Science MD Meetup Group, a talk titled “Lose Yourself in Rapalytics,” by Abhay, a PhD student in University of Maryland, Baltimore County (UMBC). Rapalytics is an online tool analyzing raps.

It is another task of text mining and natural language processing. He mentioned a few common tools. However, he also specifically looked at rhymes (as rhyme is an important element of rap lyrics), and profanity (as rap music is commonly, or stereotypically, dirty).

Play with it!

Paris attack on Nov 13, 2015 shocked the Earth. And the discussions about terrorism, nationalism, imperialism, religious institutions and fundamentalism is getting heated again. However, there have been some social scientists who argue that these are not the core roots, but the social networks.

Cass Sunstein, a Professor at Harvard University , discussed about it right after the 9/11 attack. He recently reposted his article, titled “Why They Hate Us,” on LinkedIn. [Sunstein 2015] He argued that terrorism, driven by an ideology in a group, is made, not born. And the influence is through social networks. It is the corporate ideology that induces polarization. Polarization is intentional. Terrorism is a by-product of freedom of speech.

The widespread use of social media like Twitter and Facebook enhanced the polarization (of any political or religious ideologies). Social media might account a lot the political polarization happening in the States too. Social networks are significant in spreading ideas, as discussed in my previous blog entry. [Ho 2015] [Centola 2015] [Granovetter 1973] There are certainly a lot of values in this viewpoint, although I am not capable of making a judgement.

I highly recommend you to read Sunstein’s article.

Scientists can computationally analyze social networks using the Python package networkx. [Tsvetovat & Kouznetsov 2011]

Topology has been shown to reveal important information about geometry and shape from data, [Carlsson 2015][Carlsson 2009] as I have talked about in various TDA blog entries. I have also demonstrated how to describe the topology if discrete data points by constructing simplicial complexes, and then calculated the homology and Betti numbers. (I will talk about persistent homology in the future.) Dealing with discrete data points in our digital storage devices, homology is the best way to describe it.

But if you are from a physics background, you may be familiar with the concept of homotopy and fundamental group. Some physicists deal with topology without digging into advanced mathematical tools but simply through solitons. There is a well-written introduction in this blog. In the physical world, an object is said to be topological if:

• there is a singular point that cannot be removed by a continuous deformation of field; [Mermin 1979]
• it has a saddle-point equation of the model that is different from another object of another topology, [Rajaraman 1987] inducing different kinds of physical dynamics; [Bray 1994]
• it can only be removed by crossing an energy barrier, which can be described by an instanton; [Calzetta, Ho, Hu 2010]
• it can proliferate through Kosterlitz-Thouless (BKT) phase transition; [Kosterliz, Thouless 1973]
• it can form in a system through a second-order phase transition at a finite rate, a process known as Kibble-Zurek mechanism (KZM); [Kibble 1976] [Zurek 1985] and
• its topology can be described by a winding number. (c.f. Betti numbers in homology)

Topological objects include vortices in magnets, superfluids, superconductors, or Skyrmions in helimagnets. [Mühlbauer et. al. 2009] [Ho et. al. 2010] They may come in honeycomb order, like Abrikosov vortices in type-II superconductors, [Abrikosov 1957] and helical nanofilaments in smectics. [Matsumoto et. al. 2009] It is widely used in fractional quantum Hall effect [Tsui et. al. 1982] and topological insulators (a lot of references can be found…). They can all be described using homotopy and winding numbers. We can see that topology is useful to describe the physical world for the complexities and patterns. There are ideas in string-net theory to use topology to describe the emergence of patterns and new phases of quantum matter. [Zeng et. al. 2015] Of course, I must not omit topological quantum computing that makes the qubits immune to environmental noise. [Das Sarma, Freedman, Nayak 2005]

However in data analytics, we do not use homotopy, albeit its beauty and usefulness in the physical world. Here are some of the reasons:

• In using homotopy, sometimes it takes decades for a lot of brains to figure out which homotopy groups to use. But in data analysis, we want to grasp the topology simply from data.
• Homotopy deals with continuous mappings, but data are discrete. Simplicial homology captures it more easily.
• In a physical system, we deal with usually one type of homotopy groups. But in data, we often deal with various topologies which we are not aware of in advance. Betti numbers can describe the topology easily by looking at data.
• Of course, homotopy is difficult to compute numerically.

Afra Zomorodian argued the use of homology over homotopy in his book as well. [Zomorodian 2009]

What should a data scientist know? What are the core skills of a data scientist? I have not seen another job title so vague and ambiguous that arouses so many debates and discussions. BD2K (Big Data to Knowledge) Centers of NIH (National Institutes of Health) [Ohno-Machado 2014] have issued funding to a few tertiary colleges in the United States to develop data science curricula, which carries on such discussions.

This is an interdisciplinary field. Around 15 years ago, I was still a matriculation student in Hong Kong. The University of Hong Kong (HKU) started a major called bioinformatics. People were puzzled about what it was indeed, because it looked like a melting pot of several unrelated disciplines (which actually a lot of freshmen complained as they did not understand the purpose of the undergraduate program). But we now understand how it is important.

So what should the students learn? It was suggested in the following figure:

You can see that the core competencies include statistics, machine learning, software engineering, reproducible research, and data visualization. Some of them are math and computer, some sciences, and some arts. And of course, individual data scientist jobs require the corresponding business knowledge.

Honestly, I do not excel in all of them. I have a physics background, which makes it easy for me to learn machine learning and research. Software engineering is not hard to pick up. But statistics is an alien theory to me, and visualization requires the artistic sense that I don’t possess.

Anyway, a lot to learn. Stay humble.

We have been talking about the elements of topological data analysis. In my previous post, I introduced simplicial complexes, concerning the ways to connect points together. In topology, it is the shape and geometry, not distances, which matter ( although while constructing the distance does play a role).

With the simplicial complexes, we can go ahead to describe its topology. We will use the techniques in algebraic topology without going into too much details. The techniques involves homology, but a full explanation of it requires the concepts of normal subgroup, kernel, image, quotient group in group theory. I will not talk about them, although I admit that there is no easy ways to talk about computational topology without touching them. I highly recommend the readers can refer to Zomorodian’s textbook for more details. [Zomorodian 2009]

I will continue with the Python class

SimplicialComplex

that I wrote in the previous blog post. Suppose we have an k-simplex, then the n-th face is any combinations with n+1 vertices. A simplicial complex is such that a face contained in a face is also a face of the complex. In this, we can define the boundary operator by

$\partial_k \sigma = \sum_i (-1)^i [v_0 v_1 \ldots \hat{v}_i \ldots v_k]$,

where $\hat{v}_i$ indicates the i-th vertex be removed. This operator gives all the boundary faces of a face $\sigma$. The faces being operated are k-faces, and this operator will be mapped to a (k-1)-faces. Then the boundary operator can be seen as a $(n_k \times n_{k-1})$-matrix, where $n_k$ is the number of k-faces. This can be easily calculated with the following method:

class SimplicialComplex:
...
def boundary_operator(self, i):
source_simplices = self.n_faces(i)
target_simplices = self.n_faces(i-1)

if len(target_simplices)==0:
S = dok_matrix((1, len(source_simplices)), dtype=np.float32)
S[0, 0:len(source_simplices)] = 1
else:
source_simplices_dict = {}
for j in range(len(source_simplices)):
source_simplices_dict[source_simplices[j]] = j
target_simplices_dict = {}
for i in range(len(target_simplices)):
target_simplices_dict[target_simplices[i]] = i

S = dok_matrix((len(target_simplices), len(source_simplices)), dtype=np.float32)
for source_simplex in source_simplices:
for a in range(len(source_simplex)):
target_simplex = source_simplex[:a]+source_simplex[(a+1):]
i = target_simplices_dict[target_simplex]
j = source_simplices_dict[source_simplex]
S[i, j] = -1 if a % 2==1 else 1 # S[i, j] = (-1)**a
return S


With the boundary operator, we can calculate the Betti numbers that characterize uniquely the topology of the shapes. Actually it involves the concept of homology groups that we are going to omit. To calculate the k-th Betti numbers, we calculate:

$\beta_k = \text{rank} (\text{ker} \partial_k) - \text{rank} (\text{Im} \partial_{k+1})$.

By rank-nullity theorem, [Jackson]

$\text{rank} (\text{ker} \partial_k) +\text{rank} (\text{Im} \partial_k) = \text{dim} (\partial_k)$

the Betti number is then

$\beta_k = \text{dim} (\partial_k) - \text{rank}(\text{Im} \partial_k)) - \text{rank} (\text{Im} \partial_{k+1})$

where the rank of the image of an operator can be easily computed using the rank method available in numpy. Then the method of calculating the Betti number is

class SimplicialComplex:
...
def betti_number(self, i):
boundop_i = self.boundary_operator(i)
boundop_ip1 = self.boundary_operator(i+1)

if i==0:
boundop_i_rank = 0
else:
try:
boundop_i_rank = np.linalg.matrix_rank(boundop_i.toarray())
except np.linalg.LinAlgError:
boundop_i_rank = boundop_i.shape[1]
try:
boundop_ip1_rank = np.linalg.matrix_rank(boundop_ip1.toarray())
except np.linalg.LinAlgError:
boundop_ip1_rank = boundop_ip1.shape[1]

return ((boundop_i.shape[1]-boundop_i_rank)-boundop_ip1_rank)


If we draw a simplicial complex on a 2-dimensional plane, we almost have $\beta_0$, $\beta_1$ and $\beta_2$. $\beta_0$ indicates the number of components, $\beta_1$ the number of bases for a tunnel, and $\beta_2$ the number of voids.

Let’s have some examples. Suppose we have a triangle, not filled.

e1 = [(0, 1), (1, 2), (2, 0)]
sc1 = SimplicialComplex(e1)


Then the Betti numbers are:


In [5]: sc1.betti_number(0)
Out[5]: 1

In [6]: sc1.betti_number(1)
Out[6]: 1

In [7]: sc1.betti_number(2)
Out[7]: 0


Let’s try another example with multiple components.

e2 = [(1,2), (2,3), (3,1), (4,5,6), (6,7), (7,4)]
sc2 = SimplicialComplex(e2)


We can graphically represent it using networkx:

import networkx as nx
import matplotlib.pyplot as plt
n2 = nx.Graph()
nx.draw(n2)
plt.show()


And its Betti numbers are as follow:


In [13]: sc2.betti_number(0)
Out[13]: 2

In [14]: sc2.betti_number(1)
Out[14]: 2

In [15]: sc2.betti_number(2)
Out[15]: 0


A better illustration is the Wolfram Demonstration, titled “Simplicial Homology of the Alpha Complex”.

On top of the techniques in this current post, we can describe the homology of discrete points using persistent homology, which I will describe in my future posts. I will probably spend a post on homotopy in comparison to other types of quantitative problems.