Generative adversarial networks (GANs) have made a big impact to the world of machine learning. It is particularly useful for generating sample data when there are insufficient data for certain purposes. It is also useful for training using data with both labeled and unlabeled data, i. e., semi-supervised learning. (SSL)

The rise of GANs also lead to the re-emergence of adversarial learning regarding the handling of unbalanced data or sensitive data. (For example, see arXiv:1707.00075.)

GAN is particularly useful for computer vision problems. However, it is not very good for natural language problems as the data cannot be generated continuously. Under this context, a modification on GAN is developed, called discriminative adversarial networks (DAN, see arXiv:1707.02198.). Unlike GANs that has a discriminator to train a generator to produce good data, DAN has two discriminators: one discriminator, usually denoted as the predictor P, that predicts on the unlabeled data, and another, usually denoted as the judge J, that classifies whether the label is a human label or a machine-predicted label.

The loss function of DAN is very similar to that of GAN: minimizing the entropy difference for the judge J for labeled data, but minimizing that for predictions for unlabeled data for the predictor P.

However, GAN and DAN are not generative-disciminative pairs.

Systems developed by enterprises such as Netflix produce recommendations. Good recommendations induce good user experience and higher return rates. Humans give recommendations based on experience, knowledge, worldviews, wisdom etc., and automatic recommendation systems do it based on big data and machine learning.

# Recommendation Strategies

Recommendation systems employ one or more of the following strategies:

1. Collaborative Filtering (CF);
2. Content-based Filtering (CBF);
3. Demographic Filtering (DF); and
4. Knowledge-Based Filtering (KBF).

## 1. Collaborative Filtering (CF)

CF recommends similar items to users of similar tastes. Whether it is user-based filtering or item-based filtering, the same assumption holds. Similarity between users or items are calculated by Pearson correlations or cosine similarities.

Matrix Factorization (MF), or similar latent semantic indexing (LSI) is actually a kind of collaborative filtering, although the users or items are converted to an encoded vector, and the recommendation scores are given by the cosine similarity between the encoded vectors of the users and the items.

Such recommendation systems suffer the cold-start problem: new users or new items cannot be accounted for when giving recommendations.

## 2. Content-Based Filtering (CBF)

CBF employs common machine learning algorithms to learn a user’s preference based on their consumption/purchase history and their profiles. Embedded vectors will be used too. However, this suffer cold-start problem.

## 3. Demographic Filtering (DF)

DF strategy makes use of users’ profiles such as age, sex, and other information to make recommendations. The algorithms might be rule-based, or machine learning also. However, nowadays, it might give rise to issues regarding fairness, equal opportunities, privacy, or ethics, in the wake of the era of GDPR or CCPA.

## 4. Knowledge-Based Filtering (KBF)

KBF makes recommendations based on the expert knowledge of the subject matter, known reasoning, or statistics. Recommendations may be made using a rule-based approach, or a predefined probabilistic model (such as census data). Some might have even employ a knowledge database. Big data may not be necessary in this kind of systems as the reasoning has been manually built-in.

# Hybrid Recommendation Systems

Hybrid recommendation systems employ more than one of the above strategies. To combine all these strategies, one might put a voting system to all the results to give an aggregated results, or a weighting scheme, or a stacked generalization to combine all these methods together.

Fake news is not something new, but it catches our attention since the 2016 presidential election campaign. Some people label a piece of news fake news if it does not align with its ideological elements in its analysis. Fake news are necessarily biased, although truth are inevitably not neutral as well. While a lot of technological tycoons want to handle fake news appropriately in their platform, a lack of a formal definition makes it difficult.

It has been introduced in previous entry that in order to detect fake news using machine learning, we have to provide a corpus. This provides a paveway to train supervised learning models.

Recently, some people look at this problem in a different way from the sociological point of view. In order to evaluate the impact imposed by fake news, instead of having a machine learning model, we need a dynamical model of the spread of fake news. In their preprint in arXiv, Brody and Meier studied it using the tools in communication theory. They proposed that the flow of information $\eta_t$ is:

$\eta_t = \sigma X t + B_t + F_t$,

where $t$ is time, $X$ is either 0 or 1 indicating the news is false or true respectively, $B_t$ is noise (Brownian motion), and $F_t$ is the fake news. The first term indicates that we have more information over time, and thus the flow of true information increases with time. This model assumes a linear evolution. The authors define $F_t$ to be fake news if its expection $latex \mathop{\mathbb{E}} (F_t) \neq 0$. Depending on the situation, $X$ and $F_t$ are either independent or correlated.

To study the impact, the authors categorized voters into three categories:

• Category I: unaware of the existence of fake news, but act rationally;
• Category II: aware of the existence of fake news, but unaware of the time point when fake news emerges;
• Category III: fully aware of the existence of fake news, and fully eliminated them when making a judgement.

There are further mathematical models to model the election dynamics, but readers can refer the details to the preprint. With a piece of fake news in favor of candidate B, this model gives the influence of fake news on one’s judgment, as plotted below:

This dynamical model confirms that by eliminating the fake news, the voters make better judgment. However, with the awareness of the piece of fake news emerging at a time unknown to the voter, the impact is still disastrous.

To me, this study actually confirms that the fact check is useless in terms or curbing the turmoil introduced by fake news. The flow of information nowadays is so without viscosity that ways to eliminate fake news has to be derived. However, we know censorship is not the way to go as it is a highway to a totalitarian government. The future of democracy is dim.

Deep learning has achieved a big success in the past few years, but its interpretive power is limited. They work largely because of the abundance of data. On the other hand, traditional machine learning algorithms are much better in interpretive power, but manual feature engineering costs a lot, due to the lack of data in earlier era. In light of this, a group of scientists initiated the work of graph networks, aiming at devising new artificial intelligence algorithms that exploits the advantages of two worlds, while still holding the principle of combinatorial generalization in constructing methods by using known building blocks to build new methods. Graph is good at interpretation as it is good for relational representation.

The use of graph networks is more than the graph convolutional neural networks (GCN) in the previous two blog entries. (part I and part II) However, to achieve relational inductive biases, an entity (an element with attributes), a relation, (a property between entities) and a rule. (a function that maps entities and relations to other entities and relations) This can be realized using graph, which is a mathematical structure that contains nodes and edges (that connect nodes.) To generalize the use of graph networks in various machine learning and deep learning methods, they reviewed the graph block, which is basically a function, or a mapping, from a graph to another graph, as shown in the algorithm below:

Works of graph networks are not non-existent; the authors listed previous works that can be seen as graph networks, for example:

• Message-passing neural network (MPNN) (2017);
• Non-local neural networks (NLNN) (2018).

The use of graph networks, I believe, is the next trend. There have been works regarding the graph-powered machine learning. (see Google AI blog, GraphAware Slideshare) I recently started an open-source project, a Python package called graphflow, to explore various algorithms using graphs, including PageRank, HITS, resistance, and non-linear resistance.

Sebastian Ruder recently wrote an article on The Gradient and asserted that the oracle of natural language processing is emerging. While I am not sure such confident statement is overstated, I do look forward to the moment that we will download pre-trained embedded language models and transfer to our use cases, just like we are using pre-trained word-embedding models such as Word2Vec and FastText.

I do not think one can really draw a parallelism between computer vision and natural language processing. Computer vision is challenging, but natural language processing is even more difficult because the tasks regarding linguistics are not limited to object or meaning recognition, but also human psychology, cultures, and linguistic diversities. The objectives are far from being identical.

However, the transferrable use of embedded language models is definitely a big step forward. Ruder quoted three articles, which I would summarize below in a few words.

• Embeddings from Language Models (ELMo, arXiv:1802.05365): based on the successful bidirectional LSTM language models, the authors developed a deep contextualized embedded models by collapses all layers in the neural network architecture.
• Universal Language Model Fine-Tuning for Text Classification (ULMFiT, arXiv:1801.06146): the authors proposed a type of architectures that learn representations for specific tasks, which involve three steps in training: a) LM pre-training: learning through unlabeled corpus with abundant data; b) LM fine-tuning: learning through labeled corpus; and c) classifier fine-tuning: transferred training for specific classification tasks.
• OpenAI Transformer (article still in progress): the author proposed a simple generative language model with the three similar steps in ULMFit: a) unsupervised pre-training: training a language model that maximizes the likelihood of a sequence of tokens within a context window; b) supervised fine-tuning: a supervised classification training that maximizes the likelihood using the Bayesian approach; c) task-specific input transformations: training the classifiers on a specific task.

These three articles are intricately related to each other. Without abundant data and good hardware, it is almost impossible to produce the language models. As Ruder suggested, we will probably have a pre-trained model up to the second step of the ULMFit and OpenAI Transformer papers, but we train our own specific model for our use. We have been doing this for word-embedding models, and this approach has been common in computer vision too.

The 2016 US Presidential Election ended with a surprise that Mr. Donald Trump won, despite the overwhelming prediction of a Clinton victory. There have been many studies challenging the theories in traditional political forecasting.

Some took an approach regarding statistics. Many studies concluded that many election forecasting models did not take into account between individual states predictions. However, a classical computation method limited such type of models that connects individual states (or fully-connected models). Hence, a group from QxBranch and Standard Cognition resorted to adiabatic quantum computation. (See: arXiv:1802.00069.)

D-Wave computers are adiabatic quantum computers that perform quantum annealing. A D-Wave 2X has 1152 qubits, and can naturally describes a Boltzmann Machine (BM) model, equivalent to Ising model in statistical physics. The energy function is described by:

$E[\mathbf{s}] = -\sum_{\mathbf{s}_i \in \mathbf{S}} b_i s_i - \sum_{\mathbf{s}_i, \mathbf{s}_j \in \mathbf{S}} W_{ij} s_i s_j$ ,

where $\mathbf{s}$ are the values of all qubits (0, 1, or their superpositions). The field strength $b_i$ and coupling constants $W_{ij}$ can be tuned. Classical models can handle the first term, which is linear; but the correlations, described by the second term, can be computationally costly for classical computers. Hence, the authors used a D-Wave quantum computer to trained the election models from June 30, 2016 to November 11, 2016 for every two weeks, and retrieved the correlations between individual states. Then The correctly simulated that Mr. Trump would win the election.

This Ising model of election was devised after the election, and it is prone to suspicion for fixing the problems using the results. However, this work demonstrated the power of a quantum computer that it solves some political modeling problems that can be too complicated for classical computers.

Quantum computation was proposed initially partly to simulate the physical universe because of the likeness of the nature and quantum systems. Some experimental simulations of Hawking radiation or Kibble-Zurek mechanisms were carried out in condensed matter systems, but they are simply too expensive to carry out. However, some scientists performed simulations on molecular systems using a quantum computer with an array of superconducting qubits. They performed the electronic structure calculation, as reported in “Scalable Quantum Simulation of Molecular Energies,” published in Physical Review X. Later, Google’s Quantum AI Team, Microsoft’s QuArC Team, and Caltech reports their work on simulating electronic structure using a quantum computer, that reduces running time but increases accuracies. Their work was reported in “Low-Depth Quantum Simulation of Materials,” also published in Physical Review X. The same team, adding a Harvard’s group, further studied the application of these molecular systems lined up as a linear array to design algorithms in quantum computers. It is reported in “Quantum Simulation of Electronic Structure with Linear Depth and Connectivity,” published in Physical Review Letters.

These people published an open-source software package, a Python library, called OpenFermion. It facilitates simulation of quantum algorithms in fermionic systems.

For a completeness, a few years ago, another group of scientists published a Python package, QuTiP, that helps simulating the open quantum systems.

Google launches her AutoML project last year, in an effort to automate the process of seeking the most appropriate neural net designs for a particular classification problem. Designing neural networks have been time consuming, despite the use of TensorFlow / Keras or other deep learning architecture nowadays. Therefore, the Google Brain team devised the Neural Architecture Search (NAS) using a recurrent neural network to perform reinforcement learning. (See their blog entry.) It is used to find the neural networks for image classifiers. (See their blog entry.)

Apparently, with a state-of-the-art hardware, it is of Google’s advantage to perform such an experiment on the CIFAR-10 dataset using 450 GPUs for 3-4 days. But this makes the work inaccessible for small companies or personal computers.

Then it comes an improvement to NAS: the Efficient Neural Architecture Search via Parameter Sharing (ENAS), which is a much more efficient method to search for a neural networks, by narrowing down the search in a subgraph. It reduces the need of GPUs.

While I do not think it is a threat to machine learning engineers, it is a great algorithm to note. It looks to me a brute-force algorithm, but it needs scientists and engineers to gain insights. Still, I believe development of the theory behind neural networks is much needed.

There are many tasks in natural language processing that are challenging. This blog entry is on text summarization, which briefly summarizes the survey article on this topic. (arXiv:1707.02268) The authors of the article defined the task to be

Automatic text summarization is the task of producing a concise and fluent summary while preserving key information content and overall meaning.

There are basically two approaches to this task:

• extractive summarization: identifying important sections of the text, and extracting them; and
• abstractive summarization: producing summary text in a new way.

Most algorithmic methods developed are of the extractive type, while most human writers summarize using abstractive approach. There are many methods in extractive approach, such as identifying given keywords, identifying sentences similar to the title, or wrangling the text at the beginning of the documents.

How do we instruct the machines to perform extractive summarization? The authors mentioned about two representations: topic and indicator. In topic representations, frequencies, tf-idf, latent semantic indexing (LSI), or topic models (such as latent Dirichlet allocation, LDA) are used. However, simply extracting these sentences out with these algorithms may not generate a readable summary. Employment of knowledge bases or considering contexts (from web search, e-mail conversation threads, scientific articles, author styles etc.) are useful.

In indicator representation, the authors mentioned the graph methods, inspired by PageRank. (see this) “Sentences form vertices of the graph and edges between the sentences indicate how similar the two sentences are.” And the key sentences are identified with ranking algorithms. Of course, machine learning methods can be used too.

Evaluation on the performance on text summarization is difficult. Human evaluation is unavoidable, but with manual approaches, some statistics can be calculated, such as ROUGE.