Whether the data are measured observations, or images (pixels), free text, factors, or shapes, they can be categorized into four following types:

- Categorical data
- Binary data
- Numerical data
- Graphical data

The most primitive representation of a feature vector looks like this:

Numerical data can be represented as individual elements above (like Tweet GRU, Query GRU), and I am not going to talk too much about it.

However, for categorical data, how do we represent them? The first basic way is to use one-hot encoding:

For each type of categorical data, each category has an integer code. In the figure above, each color has a code (0 for red, 1 for orange etc.) and they will eventually be transformed to the feature vector on the right, with vector length being the total number of categories found in the data, and the element will be filled with 1 if it is of that category. This allows a natural way of dealing with missing data (with all elements 0) and multi-category (with multiple non-zeros).

In natural language processing, the bag-of-words model is often used to represent free-text data, which is the one-hot encoding above with words as the categories. It is a good way as long as the order of the words does not matter.

For binary data, it can be easily represented by one element, either 1 or 0.

Graphical data are best represented in terms of graph Laplacian and adjacency matrix. Refer to a previous blog article for more information.

A feature vector can be a concatenation of various features in terms of all these types except graphical data.

However, such representation that concatenates all the categorical, binary, and numerical fields has a lot of shortcomings:

- Data with different categories are often seen as orthogonal, i.e., perfectly dissimilar. It ignores the correlation between different variables. However, it is a very big assumption.
- The weights of different fields are not considered.
- Sometimes if the numerical values are very large, it outweighs other categorical data in terms of influence in computation.
- Data are very sparse, costing a lot of memory waste and computing time.
- It is unknown whether some of the data are irrelevant.

In light of the shortcomings, to modify the feature factors, there are three main ways of dealing with this:

- Rescaling: rescaling all of some of the elements, or reweighing, to adjust the influence from different variables.
- Embedding: condensing the information into vectors of smaller lengths.
- Sparse coding: deliberately extend the vectors to a larger length.

Rescaling means rescaling all or some of the elements in the vectors. Usually there are two ways:

- Normalization: normalizing all the categories of one feature to having the sum of 1.
- Term frequency-inverse document frequency (tf-idf): weighing the elements so that the weights are heavier if the frequency is higher and it appears in relatively few documents or class labels.

Embedding means condensing a sparse vector to a smaller vector. Many sparse elements disappear and information is encoded inside the elements. There are rich amount of work on this.

- Topic models: finding the topic models (latent Dirichlet allocation (LDA), structural topic models (STM) etc.) and encode the vectors with topics instead;
- Global dimensionality reduction algorithms: reducing the dimensions by retaining the principal components of the vectors of all the data, e.g., principal component analysis (PCA), independent component analysis (ICA), multi-dimensional scaling (MDS) etc;
- Local dimensionality reduction algorithms: same as the global, but these are good for finding local patterns, where examples include t-Distributed Stochastic Neighbor Embedding (tSNE) and Uniform Manifold Approximation and Projection (UMAP);
- Representation learned from deep neural networks: embeddings learned from encoding using neural networks, such as auto-encoders, Word2Vec, FastText, BERT etc.
- Mixture Models: Gaussian mixture models (GMM), Dirichlet multinomial mixture (DMM) etc.
- Others: Tensor decomposition (Schmidt decomposition, Jennrich algorithm etc.), GloVe etc.

Sparse coding is good for finding basis vectors for dense vectors.

- tf-idf [StanfordNLP].
- Graph Convolutional Neural Network (Part I),
*Everything About Data Analytics*, WordPress (2018). [WordPress] - David M. Blei, Andrew Y. Ng, Michael I. Jordan, “Latent Dirichlet Allocation,”
*Journal of Machine Learning***3**, 993-1022 (2003). [JMLR] - Ian Holmes, Keith Harris, Christopher Quince, “Dirichlet Multinomial Mixtures: Generative Models for Microbial Metagenomics,” PLoS ONE 7(2): e30126. (2012) [PLoSOne]
- Roberts, Stewart, Tingley, and Airoldi, “The Structural Topic Model and Applied Social Science, ”
*Advances in Neural Information Processing Systems Workshop on Topic Models: Computation, Application, and Evaluation*(2013). [Princeton] [STM] - Leland McInnes, John Healy, James Melville, “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction,” arXiv:1802.03426 (2018) [arXiv]
- PyPI: pyqentangle. [PyPI]
- PyPI: mogu. [PyPI] Source code for Jennrich algorithm: https://github.com/stephenhky/MoguNumerics/blob/master/mogu/tensor/decompose.py

The rise of GANs also lead to the re-emergence of adversarial learning regarding the handling of unbalanced data or sensitive data. (For example, see arXiv:1707.00075.)

GAN is particularly useful for computer vision problems. However, it is not very good for natural language problems as the data cannot be generated continuously. Under this context, a modification on GAN is developed, called discriminative adversarial networks (DAN, see arXiv:1707.02198.). Unlike GANs that has a discriminator to train a generator to produce good data, DAN has two discriminators: one discriminator, usually denoted as the predictor *P*, that predicts on the unlabeled data, and another, usually denoted as the judge *J*, that classifies whether the label is a human label or a machine-predicted label.

The loss function of DAN is very similar to that of GAN: minimizing the entropy difference for the judge *J* for labeled data, but minimizing that for predictions for unlabeled data for the predictor *P*.

However, GAN and DAN are not generative-disciminative pairs.

- “Generative Adversarial Networks,”
*Everything About Data Analytics*, WordPress (2017). [WordPress] - Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, “Generative Adversarial Networks,” arXiv:1406.2661 (2014). [arXiv]
- Ian Goodfellow, “NIPS 2016 Tutorial: Generative Adversarial Networks,” arXiv:1701.00160 (2017). [arXiv]
- “Application of Wasserstein GAN,”
*Everything About Data Analytics*, WordPress (2017). [WordPress] - Alex Beutel, Jilin Chen, Zhe Zhao, Ed H. Chi, “Data Decisions and Theoretical Implications when Adversarially Learning Fair Representations,” 2017 Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML 2017) / arXiv:1707.00075 (2017). [arXiv]
- Cicero Nogueira dos Santos, Kahini Wadhawan, Bowen Zhou, “Learning Loss Functions for Semi-supervised Learning via Discriminative Adversarial Networks,” arXiv:1707.02198. (2017) [arXiv]

Recommendation systems employ one or more of the following strategies:

- Collaborative Filtering (CF);
- Content-based Filtering (CBF);
- Demographic Filtering (DF); and
- Knowledge-Based Filtering (KBF).

CF recommends similar items to users of similar tastes. Whether it is user-based filtering or item-based filtering, the same assumption holds. Similarity between users or items are calculated by Pearson correlations or cosine similarities.

Matrix Factorization (MF), or similar latent semantic indexing (LSI) is actually a kind of collaborative filtering, although the users or items are converted to an encoded vector, and the recommendation scores are given by the cosine similarity between the encoded vectors of the users and the items.

Such recommendation systems suffer the cold-start problem: new users or new items cannot be accounted for when giving recommendations.

CBF employs common machine learning algorithms to learn a user’s preference based on their consumption/purchase history and their profiles. Embedded vectors will be used too. However, this suffer cold-start problem.

DF strategy makes use of users’ profiles such as age, sex, and other information to make recommendations. The algorithms might be rule-based, or machine learning also. However, nowadays, it might give rise to issues regarding fairness, equal opportunities, privacy, or ethics, in the wake of the era of GDPR or CCPA.

KBF makes recommendations based on the expert knowledge of the subject matter, known reasoning, or statistics. Recommendations may be made using a rule-based approach, or a predefined probabilistic model (such as census data). Some might have even employ a knowledge database. Big data may not be necessary in this kind of systems as the reasoning has been manually built-in.

Hybrid recommendation systems employ more than one of the above strategies. To combine all these strategies, one might put a voting system to all the results to give an aggregated results, or a weighting scheme, or a stacked generalization to combine all these methods together.

- Chhavi Saluja, Recommendation Systems exemplified,
*Towards Data Science*. (2018) [Medium] - Toby Segaran,
*Programming Collective Intelligence: Building Smart Web 2.0 Applications*, (2008). [O’Reilly] - Erion Çano, Maurizio Morisio, “Hybrid Recommender Systems: A Systematic Literature Review,” arXiv:1901.03888 (2019). [arXiv]
- Ankur Moitra,
*Algorithmic Aspects of Machine Learning*(2018). [Cambridge]

It has been introduced in previous entry that in order to detect fake news using machine learning, we have to provide a corpus. This provides a paveway to train supervised learning models.

Recently, some people look at this problem in a different way from the sociological point of view. In order to evaluate the impact imposed by fake news, instead of having a machine learning model, we need a dynamical model of the spread of fake news. In their preprint in arXiv, Brody and Meier studied it using the tools in communication theory. They proposed that the flow of information is:

,

where is time, is either 0 or 1 indicating the news is false or true respectively, is noise (Brownian motion), and is the fake news. The first term indicates that we have more information over time, and thus the flow of true information increases with time. This model assumes a linear evolution. The authors define to be fake news if its expection $latex \mathop{\mathbb{E}} (F_t) \neq 0$. Depending on the situation, and are either independent or correlated.

To study the impact, the authors categorized voters into three categories:

- Category I: unaware of the existence of fake news, but act rationally;
- Category II: aware of the existence of fake news, but unaware of the time point when fake news emerges;
- Category III: fully aware of the existence of fake news, and fully eliminated them when making a judgement.

There are further mathematical models to model the election dynamics, but readers can refer the details to the preprint. With a piece of fake news in favor of candidate B, this model gives the influence of fake news on one’s judgment, as plotted below:

This dynamical model confirms that by eliminating the fake news, the voters make better judgment. However, with the awareness of the piece of fake news emerging at a time unknown to the voter, the impact is still disastrous.

To me, this study actually confirms that the fact check is useless in terms or curbing the turmoil introduced by fake news. The flow of information nowadays is so without viscosity that ways to eliminate fake news has to be derived. However, we know censorship is not the way to go as it is a highway to a totalitarian government. The future of democracy is dim.

- “Data Science of Fake News,”
*Everything about Data Analytics*, WordPress (2017). [WordPress] - “A mathematical model captures the political impact of fake news,”
*MIT Technology Review*(September 2018). [MITTechReview] - Dorje C. Brody, David M. Meier, “How to model fake news,” arXiv:1809.00965 (2018). [arXiv]
- Github: dmmeier/FakeNews. [Github]

The use of graph networks is more than the graph convolutional neural networks (GCN) in the previous two blog entries. (part I and part II) However, to achieve relational inductive biases, an entity (an element with attributes), a relation, (a property between entities) and a rule. (a function that maps entities and relations to other entities and relations) This can be realized using graph, which is a mathematical structure that contains nodes and edges (that connect nodes.) To generalize the use of graph networks in various machine learning and deep learning methods, they reviewed the graph block, which is basically a function, or a mapping, from a graph to another graph, as shown in the algorithm below:

Works of graph networks are not non-existent; the authors listed previous works that can be seen as graph networks, for example:

- Message-passing neural network (MPNN) (2017);
- Non-local neural networks (NLNN) (2018).

The use of graph networks, I believe, is the next trend. There have been works regarding the graph-powered machine learning. (see Google AI blog, GraphAware Slideshare) I recently started an open-source project, a Python package called graphflow, to explore various algorithms using graphs, including PageRank, HITS, resistance, and non-linear resistance.

- Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Caglar Gulcehre, Francis Song, Andrew Ballard, Justin Gilmer, George Dahl, Ashish Vaswani, Kelsey Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matt Botvinick, Oriol Vinyals, Yujia Li, Razvan Pascanu, “Relational inductive biases, deep learning, and graph networks,” arXiv:1806.01261 (2018). [arXiv]
- “Graph Convolutional Neural Network (Part I),”
*Everything About Data Analytics*, WordPress (2018) [WordPress] - “Graph Convolutional Neural Network (Part II),”
*Everything About Data Analytics*, WordPress (2018) [WordPress] - Sujith Ravi, “Graph-powered Machine Learning at Google,” Google AI Research Blog (2016). [Google]
- Sujith Ravi, Qiming Diao, “Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation,” arXiv:1512.01752. (2015) [arXiv]
- Vlasta Kus, “Graph-Powered Machine Learning,”
*GraphAware*. [SlideShare] - Maksim Tsvetovat, Alexander Kouznetsov,
*Social Network Analysis for Startups*, O’Reilly (2011). [O’Reilly] - PyPI: graphflow [PyPI]; Github: stephenhky/GraphFlow [Github]

Kipf and Welling proposed the ChebNet (arXiv:1609.02907) to approximate the filter using Chebyshev polynomial, using the result proved by Hammond *et. al.* (arXiv:0912.3848) With a convolutional layer , instead of computing it directly by , it approximates the filter in terms of Chebyshev polynomial:

,

where are the Chebyshev polynomials. It has been proved by Hammond *et. al.* that by well truncating this representation, Chebyshev polynomials are very good to approximate the function. Then the convolution is:

.

Here, ’s are the parameters to be trained. This fixed the bases of the representation, and it speeds up computation. The disadvantage is that eigenvalues are clusters in a few values with large gaps.

The problem of ChebNet led to the work of Levie *et. al.,* (arXiv:1705.07664) who proposed another approximation is used using Cayley transform. They made use of the Cayley function:

,

which is a bijective function from to complex unit half-circle. Instead of Chebyshev polynomials, it approximates the filter as:

,

where is real and other ’s are generally complex, and is a zoom parameter, and ’s are the eigenvalues of the graph Laplacian. Tuning $h$ makes one find the best zoom that spread the top eigenvalues. ‘s are computed by training. This solves the problem of unfavorable clusters in ChebNet.

All previous works are undirected graph. How do we deal with directed graph with an asymmetric graph Laplacian? Benson, Gleich, Leskovec published an important work on Science in 2016 (arXiv:1612.08447) to address this problem. Their approach is reducing a directed graph to a higher order structure called network motifs. There are 13 network motifs. For each network motif, one can define an adjacency matrix for that motif by , with elements being the number of motifs in the graph the the edge that it belongs to.

Then one computes 13 graph Laplacians from these 13 adjacency matrices. These graph Laplacians are symmetric, like those of undirected graphs. Then any filters can be approximated by the following multivariate matrix polynomial, as suggested by Monti, Otness, and Bronstein in their MotifNet paper (arXiv:1802.01572):

.

Applications of image processing, citation networks etc.

- “Graph Convolutional Neural Network (Part I),”
*Everything About Data Analytics*, WordPress (2018). [WordPress] - Joan Bruna, Wojciech Zaremba, Arthur Szlam, Yann LeCun, “Spectral Networks and Locally Connected Networks on Graphs,” arXiv:1312.6203 (2013). [arXiv]
- Eric Weisstein, “Chebyshev Polynomial of the First Kind,” Wolfram World. [Wolfram]
- “Cayley transform.” [Wikipedia]
- Thomas N. Kipf, Max Welling, “Semi-Supervised Classification with Graph Convolutional Networks,” arXiv:1609.02907 (2016). [arXiv]
- David K Hammond, Pierre Vandergheynst, Rémi Gribonval, “Wavelets on Graphs via Spectral Graph Theory,” arXiv:0912.3848 (2009). [arXiv]
- Ron Levie, Federico Monti, Xavier Bresson, Michael M. Bronstein, “CayleyNets: Graph Convolutional Neural Networks with Complex Rational Spectral Filters,” arXiv:1705.07664 (2017). [arXiv]
- Austin R. Benson, David F. Gleich, Jure Leskovec, “Higher-order organization of complex networks,”
*Science***353**(6295): 163-166 (2016). (arXiv:1612.08447) [arXiv] [Science] - Federico Monti, Karl Otness, Michael M. Bronstein, “MotifNet: a motif-based Graph Convolutional Network for directed graphs,” arXiv:1802.01572 (2018). [arXiv]
- Michael Bronstein, “Geometric Deep Learning on Graphs and Manifolds,”
*Graph Signal Processing Workshop*(June 7, 2018, Lausanne). [PDF]

It would be helpful to review some basic graph theory:

- A graph is represented by , where and are the nodes (vertices) and edges respectively.
- A graph can be directed or undirected. In graph convolutional neural network, they are undirected usually.
- The adjacency matrix describes how nodes are connected: if there is an edge connecting from node to node , and otherwise. is a symmetric matrix for an undirected graph.
- The incidence matrix is another way to describe how nodes are connected: if a node is connected with edge . This is useful for undirected graph.
- The degree matrix is a diagonal matrix, with elements denotes the number of neighbors for node in undirected matrix.
- The function acting on the nodes is called the filter.
- The graph Laplacian, or Kirchhoff matrix, is defined by , and the normalized graph Laplacian is .

The graph Laplacian is the most important matrix in graph convolutional neural network. It is analogous to the Laplacian operator in Euclidean space, . The reader can easily verify this by constructing a graph of 2D lattice and compute the graph Laplacian matrix, and find that it is the same as the discretized Laplacian operator.

We can also get some insights from the Euclidean analogue. In physics, the solution to the Laplacian equation is harmonic: the basis of the solution can be described in the spectral/Fourier space, as:

,

And . In graph convolutional neural network, as Bruna et. al. suggested in 2013, the graph is calculated in the graph Fourier space, instead of directly dealing with the Laplacian matrix in all layers of network.

On the other hand, we know that for the convolution

,

its Fourier transform is given by

.

In Fourier space, the convolution of two functions are just their products. Similarly, in graph convolutional neural network, convolutions can be computed in the Fourier space as the mere product of two filters in the Fourier space. More specifically, for finding the convolution of the filters and , with being the unitary eigenmatrix,

.

However, such general description is basis-dependent, and it is still computationally expensive. More work has been proposed to smooth the representation, which will be covered in the upcoming blogs.

A side note, the readers can verify themselves that

- Python pacakge: graphflow. [PyPI] [Github]
- Python package: mogutda. [PyPI] [Github]
- Joan Bruna, Wojciech Zaremba, Arthur Szlam, Yann LeCun, “Spectral Networks and Locally Connected Networks on Graphs,” arXiv:1312.6203 (2013). [arXiv]
- Blake Nelson, Deve Palakkattukudy, “Driving Predictive Analytics with the Power of Neo4j,” neo4j blog (2018). [neo4j]

I do not think one can really draw a parallelism between computer vision and natural language processing. Computer vision is challenging, but natural language processing is even more difficult because the tasks regarding linguistics are not limited to object or meaning recognition, but also human psychology, cultures, and linguistic diversities. The objectives are far from being identical.

However, the transferrable use of embedded language models is definitely a big step forward. Ruder quoted three articles, which I would summarize below in a few words.

- Embeddings from Language Models (ELMo, arXiv:1802.05365): based on the successful bidirectional LSTM language models, the authors developed a deep contextualized embedded models by collapses all layers in the neural network architecture.
- Universal Language Model Fine-Tuning for Text Classification (ULMFiT, arXiv:1801.06146): the authors proposed a type of architectures that learn representations for specific tasks, which involve three steps in training: a) LM pre-training: learning through unlabeled corpus with abundant data; b) LM fine-tuning: learning through labeled corpus; and c) classifier fine-tuning: transferred training for specific classification tasks.
- OpenAI Transformer (article still in progress): the author proposed a simple generative language model with the three similar steps in ULMFit: a) unsupervised pre-training: training a language model that maximizes the likelihood of a sequence of tokens within a context window; b) supervised fine-tuning: a supervised classification training that maximizes the likelihood using the Bayesian approach; c) task-specific input transformations: training the classifiers on a specific task.

These three articles are intricately related to each other. Without abundant data and good hardware, it is almost impossible to produce the language models. As Ruder suggested, we will probably have a pre-trained model up to the second step of the ULMFit and OpenAI Transformer papers, but we train our own specific model for our use. We have been doing this for word-embedding models, and this approach has been common in computer vision too.

- Sebastian Ruder, “NLP’s ImageNet moment has arrived,”
*The Gradient*(July 2018). [Gradient] (Chinese translation in Zhihu) - Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer, “Deep contextualized word representations,” arXiv:1802.05365 (2018). [arXiv]
- Jeremy Howard, Sebastian Ruder, “Universal Language Model Fine-tuning for Text Classification,” arXiv:1801.06146 (2018). [arXiv]
- Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, “Impriving Language Understanding by Generative Pre-Training.” (work in progress) [pdf]

Note: the feature image is adapted from Figure 1 of ULMFiT paper.

]]>Upon a few inquiries, I decided to release the codes as a PyPI package, and I named it mogutda, under the MIT license. It is open-source, and the codes can be found at the Github repository MoguTDA. It runs in Python 2.7, 3.5, and 3.6.

For more information and simple tutorial, please refer to the documentation, or the Github page.

- Github: stephenhky/PyTDA. [Github]
- PyPI: mogutda. [PyPI]
- Github: stephenhky/MoguTDA. [Github]
- mogutda’s documentation. [RTFD]
- “Starting the Journey of Topological Data Analysis (TDA),”
*Everything About Data Analytics*, WordPress (2015). [WordPress] - “Constructing Connectivities,”
*Everything About Data Analytics*, WordPress (2015). [WordPress] - “Homology and Betti Numbers,”
*Everything About Data Analytics*, WordPress (2015). [WordPress]

Some took an approach regarding statistics. Many studies concluded that many election forecasting models did not take into account between individual states predictions. However, a classical computation method limited such type of models that connects individual states (or fully-connected models). Hence, a group from QxBranch and Standard Cognition resorted to adiabatic quantum computation. (See: arXiv:1802.00069.)

D-Wave computers are adiabatic quantum computers that perform quantum annealing. A D-Wave 2X has 1152 qubits, and can naturally describes a Boltzmann Machine (BM) model, equivalent to Ising model in statistical physics. The energy function is described by:

,

where are the values of all qubits (0, 1, or their superpositions). The field strength and coupling constants can be tuned. Classical models can handle the first term, which is linear; but the correlations, described by the second term, can be computationally costly for classical computers. Hence, the authors used a D-Wave quantum computer to trained the election models from June 30, 2016 to November 11, 2016 for every two weeks, and retrieved the correlations between individual states. Then The correctly simulated that Mr. Trump would win the election.

This Ising model of election was devised after the election, and it is prone to suspicion for fixing the problems using the results. However, this work demonstrated the power of a quantum computer that it solves some political modeling problems that can be too complicated for classical computers.

- Wikipedia: Quantum Annealing. [Wikipedia]
- Maxwell Henderson, John Novak, Tristan Cook, “Leveraging Adiabatic Quantum Computation for Election Forecasting,” arXiv:1802.00069 (2018). [arXiv]
- “On Quantum Computer,”
*Everything About Data Analytics*(2016). [WordPress] - QxBranch. [QxBranch]
- Standard Cognition. [StandardCognition]