Embedded Language Models

Sebastian Ruder recently wrote an article on The Gradient and asserted that the oracle of natural language processing is emerging. While I am not sure such confident statement is overstated, I do look forward to the moment that we will download pre-trained embedded language models and transfer to our use cases, just like we are using pre-trained word-embedding models such as Word2Vec and FastText.

I do not think one can really draw a parallelism between computer vision and natural language processing. Computer vision is challenging, but natural language processing is even more difficult because the tasks regarding linguistics are not limited to object or meaning recognition, but also human psychology, cultures, and linguistic diversities. The objectives are far from being identical.

However, the transferrable use of embedded language models is definitely a big step forward. Ruder quoted three articles, which I would summarize below in a few words.

  • Embeddings from Language Models (ELMo, arXiv:1802.05365): based on the successful bidirectional LSTM language models, the authors developed a deep contextualized embedded models by collapses all layers in the neural network architecture.
  • Universal Language Model Fine-Tuning for Text Classification (ULMFiT, arXiv:1801.06146): the authors proposed a type of architectures that learn representations for specific tasks, which involve three steps in training: a) LM pre-training: learning through unlabeled corpus with abundant data; b) LM fine-tuning: learning through labeled corpus; and c) classifier fine-tuning: transferred training for specific classification tasks.
  • OpenAI Transformer (article still in progress): the author proposed a simple generative language model with the three similar steps in ULMFit: a) unsupervised pre-training: training a language model that maximizes the likelihood of a sequence of tokens within a context window; b) supervised fine-tuning: a supervised classification training that maximizes the likelihood using the Bayesian approach; c) task-specific input transformations: training the classifiers on a specific task.

These three articles are intricately related to each other. Without abundant data and good hardware, it is almost impossible to produce the language models. As Ruder suggested, we will probably have a pre-trained model up to the second step of the ULMFit and OpenAI Transformer papers, but we train our own specific model for our use. We have been doing this for word-embedding models, and this approach has been common in computer vision too.

Continue reading “Embedded Language Models”

Advertisements

moguTDA: Python package for Simplicial Complex

It has been a while since I wrote about topological data analysis (TDA). For pedagogical reasons, a lot of the codes were demonstrated in the Github repository PyTDA. However, it is not modularized as a package, and those codes run in Python 2.7 only.

Upon a few inquiries, I decided to release the codes as a PyPI package, and I named it mogutda, under the MIT license. It is open-source, and the codes can be found at the Github repository MoguTDA. It runs in Python 2.7, 3.5, and 3.6.

For more information and simple tutorial, please refer to the documentation, or the Github page.

Continue reading “moguTDA: Python package for Simplicial Complex”

Election Forecasting using Quantum Computers

The 2016 US Presidential Election ended with a surprise that Mr. Donald Trump won, despite the overwhelming prediction of a Clinton victory. There have been many studies challenging the theories in traditional political forecasting.

Some took an approach regarding statistics. Many studies concluded that many election forecasting models did not take into account between individual states predictions. However, a classical computation method limited such type of models that connects individual states (or fully-connected models). Hence, a group from QxBranch and Standard Cognition resorted to adiabatic quantum computation. (See: arXiv:1802.00069.)

D-Wave computers are adiabatic quantum computers that perform quantum annealing. A D-Wave 2X has 1152 qubits, and can naturally describes a Boltzmann Machine (BM) model, equivalent to Ising model in statistical physics. The energy function is described by:

E[\mathbf{s}] = -\sum_{\mathbf{s}_i \in \mathbf{S}} b_i s_i - \sum_{\mathbf{s}_i, \mathbf{s}_j \in \mathbf{S}} W_{ij} s_i s_j ,

where \mathbf{s} are the values of all qubits (0, 1, or their superpositions). The field strength b_i and coupling constants W_{ij} can be tuned. Classical models can handle the first term, which is linear; but the correlations, described by the second term, can be computationally costly for classical computers. Hence, the authors used a D-Wave quantum computer to trained the election models from June 30, 2016 to November 11, 2016 for every two weeks, and retrieved the correlations between individual states. Then The correctly simulated that Mr. Trump would win the election.

This Ising model of election was devised after the election, and it is prone to suspicion for fixing the problems using the results. However, this work demonstrated the power of a quantum computer that it solves some political modeling problems that can be too complicated for classical computers.

Continue reading “Election Forecasting using Quantum Computers”

Quantum Chemistry Simulation on Quantum Computers

Quantum computation was proposed initially partly to simulate the physical universe because of the likeness of the nature and quantum systems. Some experimental simulations of Hawking radiation or Kibble-Zurek mechanisms were carried out in condensed matter systems, but they are simply too expensive to carry out. However, some scientists performed simulations on molecular systems using a quantum computer with an array of superconducting qubits. They performed the electronic structure calculation, as reported in “Scalable Quantum Simulation of Molecular Energies,” published in Physical Review X. Later, Google’s Quantum AI Team, Microsoft’s QuArC Team, and Caltech reports their work on simulating electronic structure using a quantum computer, that reduces running time but increases accuracies. Their work was reported in “Low-Depth Quantum Simulation of Materials,” also published in Physical Review X. The same team, adding a Harvard’s group, further studied the application of these molecular systems lined up as a linear array to design algorithms in quantum computers. It is reported in “Quantum Simulation of Electronic Structure with Linear Depth and Connectivity,” published in Physical Review Letters.

These people published an open-source software package, a Python library, called OpenFermion. It facilitates simulation of quantum algorithms in fermionic systems.

For a completeness, a few years ago, another group of scientists published a Python package, QuTiP, that helps simulating the open quantum systems.

Continue reading “Quantum Chemistry Simulation on Quantum Computers”

Development of Neural Architecture Search

Google launches her AutoML project last year, in an effort to automate the process of seeking the most appropriate neural net designs for a particular classification problem. Designing neural networks have been time consuming, despite the use of TensorFlow / Keras or other deep learning architecture nowadays. Therefore, the Google Brain team devised the Neural Architecture Search (NAS) using a recurrent neural network to perform reinforcement learning. (See their blog entry.) It is used to find the neural networks for image classifiers. (See their blog entry.)

Apparently, with a state-of-the-art hardware, it is of Google’s advantage to perform such an experiment on the CIFAR-10 dataset using 450 GPUs for 3-4 days. But this makes the work inaccessible for small companies or personal computers.

Then it comes an improvement to NAS: the Efficient Neural Architecture Search via Parameter Sharing (ENAS), which is a much more efficient method to search for a neural networks, by narrowing down the search in a subgraph. It reduces the need of GPUs.

While I do not think it is a threat to machine learning engineers, it is a great algorithm to note. It looks to me a brute-force algorithm, but it needs scientists and engineers to gain insights. Still, I believe development of the theory behind neural networks is much needed.

Continue reading “Development of Neural Architecture Search”

Summarizing Text Summarization

There are many tasks in natural language processing that are challenging. This blog entry is on text summarization, which briefly summarizes the survey article on this topic. (arXiv:1707.02268) The authors of the article defined the task to be

Automatic text summarization is the task of producing a concise and fluent summary while preserving key information content and overall meaning.

There are basically two approaches to this task:

  • extractive summarization: identifying important sections of the text, and extracting them; and
  • abstractive summarization: producing summary text in a new way.

Most algorithmic methods developed are of the extractive type, while most human writers summarize using abstractive approach. There are many methods in extractive approach, such as identifying given keywords, identifying sentences similar to the title, or wrangling the text at the beginning of the documents.

How do we instruct the machines to perform extractive summarization? The authors mentioned about two representations: topic and indicator. In topic representations, frequencies, tf-idf, latent semantic indexing (LSI), or topic models (such as latent Dirichlet allocation, LDA) are used. However, simply extracting these sentences out with these algorithms may not generate a readable summary. Employment of knowledge bases or considering contexts (from web search, e-mail conversation threads, scientific articles, author styles etc.) are useful.

In indicator representation, the authors mentioned the graph methods, inspired by PageRank. (see this) “Sentences form vertices of the graph and edges between the sentences indicate how similar the two sentences are.” And the key sentences are identified with ranking algorithms. Of course, machine learning methods can be used too.

Evaluation on the performance on text summarization is difficult. Human evaluation is unavoidable, but with manual approaches, some statistics can be calculated, such as ROUGE.

Continue reading “Summarizing Text Summarization”

Essential Python Packages

Almost three years ago, I wrote a blog entry titled Useful Python Packages, which listed the essential packages that I deemed important. How has the list been changed over the past three years?

First of all, three years ago, most people were still writing Python 2.7. But now there is a trend to switch to Python 3. I admitted that I still have not started the switch yet, but in the short term, I will have no choice and I will.

What are some of the essential packages?
Numerical Packages

  • numpy: numerical Python, containing most basic numerical routines such as matrix manipulation, linear algebra, random sampling, numerical integration etc. There is a built-in wrapper for Fortran as well. Actually, numpy is so important that some Linux system includes it with Python.
  • scipy: scientific Python, containing some functions useful for scientific computing, such as sparse matrices, numerical differential equations, advanced linear algebra, special functions etc.
  • networkx: package that handles various types of networks
  • PuLP: linear programming
  • cvxopt: convex optimization

Data Visualization

  • matplotlib: basic plotting.
  • ggplot2: the ggplot2 counterpart in Python for producing quality publication plots.

Data Manipulation

  • pandas: data manipulation, working with data frames in Python, and save/load of various formats such as CSV and Excel

Machine Learning

  • scikit-learn: machine-learning library in Python, containing classes and functions for supervised and unsupervised learning

Probabilistic Programming

  • PyMC: Metropolis-Hasting algorithm
  • Edward: deep probabilistic programing

Deep Learning Frameworks

  • TensorFlow: because of Google’s marketing effort, TensorFlow is now the industrial standard for building deep learning networks, with rich source of mathematical functions, esp. for neural network cells, with GPU capability
  • Keras: containing routines of high-level layers for deep learning neural networks, with TensorFlow, Theano, or CNTK as the backbone
  • PyTorch: a rivalry against TensorFlow

Natural Language Processing

  • nltk: natural language processing toolkit for Python, containing bag-of-words model, tokenizer, stemmers, chunker, lemmatizers, part-of-speech taggers etc.
  • gensim: a useful natural language processing package useful for topic modeling, word-embedding, latent semantic indexing etc., running in a fast fashion
  • shorttext: text mining package good for handling short sentences, that provide high-level routines for training neural network classifiers, or generating feature represented by topic models or autoencodings.
  • spacy: industrial standard for natural language processing common tools

GUI

I can probably list more, but I think I covered most of them. If you do not find something useful, it is probably time for you to write a brand new package.

Document-Term Matrix: Text Mining in R and Python

In text mining, it is important to create the document-term matrix (DTM) of the corpus we are interested in. A DTM is basically a matrix, with documents designated by rows and words by columns, that the elements are the counts or the weights (usually by tf-idf). Subsequent analysis is usually based creatively on DTM.

Exploring with DTM therefore becomes an important issues with a good text-mining tool. How do we perform exploratory data analysis on DTM using R and Python? We will demonstrate it using the data set of U. S. Presidents’ Inaugural Address, preprocessed, and can be downloaded here.

R: textmineR

In R, we can use the package textmineR, which has been in introduced in a previous post. Together with other packages such as dplyr (for tidy data analysis) and snowBall (for stemming), load all of them at the beginning:

library(dplyr)
library(textmineR)
library(SnowballC)

Load the datasets:

usprez.df<- read.csv('inaugural.csv', stringsAsFactors = FALSE)

Then we create the DTM, while we remove all digits and punctuations, make all letters lowercase, and stem all words using Porter stemmer.

dtm<- CreateDtm(usprez.df$speech,
                doc_names = usprez.df$yrprez,
                ngram_window = c(1, 1),
                lower = TRUE,
                remove_punctuation = TRUE,
                remove_numbers = TRUE,
                stem_lemma_function = wordStem)

Then defining a set of functions:

get.doc.tokens<- function(dtm, docid)
  dtm[docid, ] %>% as.data.frame() %>% rename(count=".") %>%
  mutate(token=row.names(.)) %>% arrange(-count)

get.token.occurrences<- function(dtm, token)
  dtm[, token] %>% as.data.frame() %>% rename(count=".") %>%
  mutate(token=row.names(.)) %>% arrange(-count)

get.total.freq<- function(dtm, token) dtm[, token] %>% sum

get.doc.freq<- function(dtm, token)
  dtm[, token] %>% as.data.frame() %>% rename(count=".") %>%
  filter(count>0) %>% pull(count) %>% length

Then we can happily extract information. For example, if we want to get the top-most common words in 2009’s Obama’s speech, enter:

dtm %>% get.doc.tokens('2009-Obama') %>% head(10)

Or which speeches have the word “change”: (but need to stem the word before extraction)

dtm %>% get.token.occurrences(wordStem('change')) %>% head(10)

You can also get the total number of occurrence of the words by:

dtm %>% get.doc.freq(wordStem('change'))   # gives 28

Python: shorttext

In Python, similar things can be done using the package shorttext, described in a previous post. It uses other packages such as pandas and stemming. Load all packages first:

import shorttext
import numpy as np
import pandas as pd
from stemming.porter import stem

import re

And define the preprocessing pipelines:

pipeline = [lambda s: re.sub('[^\w\s]', '', s),
            lambda s: re.sub('[\d]', '', s),
            lambda s: s.lower(),
            lambda s: ' '.join(map(stem, shorttext.utils.tokenize(s)))
 ]
txtpreproceesor = shorttext.utils.text_preprocessor(pipeline)

The function <code>txtpreprocessor</code> above perform the functions we talked about in R.

Load the dataset:

usprezdf = pd.read_csv('inaugural.csv')

The corpus needs to be preprocessed before putting into the DTM:

docids = list(usprezdf['yrprez'])    # defining document IDs
corpus = [txtpreproceesor(speech).split(' ') for speech in usprezdf['speech']]

Then create the DTM:

dtm = shorttext.utils.DocumentTermMatrix(corpus, docids=docids, tfidf=False)

Then we do the same thing as we have done above. To get the top-most common words in 2009’s Obama’s speech, enter:

dtm.get_doc_tokens('2009-Obama')

Or we look up which speeches have the word “change”:

dtm.get_token_occurences(stem('change'))

Or to get the document frequency of the word:

dtm.get_doc_frequency(stem('change'))

They Python and R codes give different document frequencies probably because the two stemmers work slightly differently.

Continue reading “Document-Term Matrix: Text Mining in R and Python”

Application of Wasserstein GAN

When it was proposed that GAN uses Wasserstein distance as the training metric, GAN is usually seen as a transportation problem. Previously, it was mentioned in a previous post that GAN can be seen as a transportation problem, and because of that, some computation can be simplified by relating a kernel in the discriminator and the generator.

GAN can be used in word translation problem too. In a recent preprint in arXiv (refer to arXiv:1710.04087), Wasserstein GAN has been used to train a machine translation machine, given that there are no parallel data between the word embeddings between two languages. The translation mapping is seen as a generator, and the mapping is described using Wasserstein distance. The training objective is cross-domain similarity local scaling (CSLS). Their work has been performed in English-Russian and English-Chinese mappings.

It seems to work. Given GAN sometimes does not work for unknown reasons, it is an excitement that it works.Screen Shot 2017-11-26 at 6.23.42 PM

Continue reading “Application of Wasserstein GAN”

Blog at WordPress.com.

Up ↑