Embedded Language Models

Sebastian Ruder recently wrote an article on The Gradient and asserted that the oracle of natural language processing is emerging. While I am not sure such confident statement is overstated, I do look forward to the moment that we will download pre-trained embedded language models and transfer to our use cases, just like we are using pre-trained word-embedding models such as Word2Vec and FastText.

I do not think one can really draw a parallelism between computer vision and natural language processing. Computer vision is challenging, but natural language processing is even more difficult because the tasks regarding linguistics are not limited to object or meaning recognition, but also human psychology, cultures, and linguistic diversities. The objectives are far from being identical.

However, the transferrable use of embedded language models is definitely a big step forward. Ruder quoted three articles, which I would summarize below in a few words.

  • Embeddings from Language Models (ELMo, arXiv:1802.05365): based on the successful bidirectional LSTM language models, the authors developed a deep contextualized embedded models by collapses all layers in the neural network architecture.
  • Universal Language Model Fine-Tuning for Text Classification (ULMFiT, arXiv:1801.06146): the authors proposed a type of architectures that learn representations for specific tasks, which involve three steps in training: a) LM pre-training: learning through unlabeled corpus with abundant data; b) LM fine-tuning: learning through labeled corpus; and c) classifier fine-tuning: transferred training for specific classification tasks.
  • OpenAI Transformer (article still in progress): the author proposed a simple generative language model with the three similar steps in ULMFit: a) unsupervised pre-training: training a language model that maximizes the likelihood of a sequence of tokens within a context window; b) supervised fine-tuning: a supervised classification training that maximizes the likelihood using the Bayesian approach; c) task-specific input transformations: training the classifiers on a specific task.

These three articles are intricately related to each other. Without abundant data and good hardware, it is almost impossible to produce the language models. As Ruder suggested, we will probably have a pre-trained model up to the second step of the ULMFit and OpenAI Transformer papers, but we train our own specific model for our use. We have been doing this for word-embedding models, and this approach has been common in computer vision too.

Continue reading “Embedded Language Models”


Summarizing Text Summarization

There are many tasks in natural language processing that are challenging. This blog entry is on text summarization, which briefly summarizes the survey article on this topic. (arXiv:1707.02268) The authors of the article defined the task to be

Automatic text summarization is the task of producing a concise and fluent summary while preserving key information content and overall meaning.

There are basically two approaches to this task:

  • extractive summarization: identifying important sections of the text, and extracting them; and
  • abstractive summarization: producing summary text in a new way.

Most algorithmic methods developed are of the extractive type, while most human writers summarize using abstractive approach. There are many methods in extractive approach, such as identifying given keywords, identifying sentences similar to the title, or wrangling the text at the beginning of the documents.

How do we instruct the machines to perform extractive summarization? The authors mentioned about two representations: topic and indicator. In topic representations, frequencies, tf-idf, latent semantic indexing (LSI), or topic models (such as latent Dirichlet allocation, LDA) are used. However, simply extracting these sentences out with these algorithms may not generate a readable summary. Employment of knowledge bases or considering contexts (from web search, e-mail conversation threads, scientific articles, author styles etc.) are useful.

In indicator representation, the authors mentioned the graph methods, inspired by PageRank. (see this) “Sentences form vertices of the graph and edges between the sentences indicate how similar the two sentences are.” And the key sentences are identified with ranking algorithms. Of course, machine learning methods can be used too.

Evaluation on the performance on text summarization is difficult. Human evaluation is unavoidable, but with manual approaches, some statistics can be calculated, such as ROUGE.

Continue reading “Summarizing Text Summarization”

Release of shorttext 0.5.4

The Python package for text mining shorttext has a new release: 0.5.4. It can be installed by typing in the command line:

pip install -U shorttext

For some people, you may need to install it from “root”, i.e., adding sudo in front of the command. Since the version 0.5 (including releases 0.5.1 and 0.5.4), there have been substantial addition of functionality, mostly about comparisons between short phrases without running a supervised or unsupervised machine learning algorithm, but calculating the “similarity” with various metrics, including:

  • soft Jaccard score (the same kind of fuzzy scores based on edit distance in SOCcer),
  • Word Mover’s distance (WMD, detailedly described in a previous post), and
  • Jaccard index due to word-embedding model.

For the soft Jaccard score due to edit distance, we can call it by:

>>> from shorttext.metrics.dynprog import soft_jaccard_score
>>> soft_jaccard_score(['book', 'seller'], ['blok', 'sellers'])     # gives 0.6716417910447762
>>> soft_jaccard_score(['police', 'station'], ['policeman'])        # gives 0.2857142857142858

The core of this code was written in C, and interfaced to Python using SWIG.

For the Word Mover’s Distance (WMD), while the source codes are the same as my previous post, it can now be called directly. First, load the modules and the word-embedding model:

>>> from shorttext.metrics.wasserstein import word_mover_distance
>>> from shorttext.utils import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')

And compute the WMD with a single function:

>>> word_mover_distance(['police', 'station'], ['policeman'], wvmodel)                      # gives 3.060708999633789
>>> word_mover_distance(['physician', 'assistant'], ['doctor', 'assistants'], wvmodel)      # gives 2.276337146759033

And the Jaccard index due to cosine distance in Word-embedding model can be called like this:

>>> from shorttext.metrics.embedfuzzy import jaccardscore_sents
>>> jaccardscore_sents('doctor', 'physician', wvmodel)   # gives 0.6401538990056869
>>> jaccardscore_sents('chief executive', 'computer cluster', wvmodel)   # gives 0.0022515450768836143
>>> jaccardscore_sents('topological data', 'data of topology', wvmodel)   # gives 0.67588977344632573

Most new functions can be found in this tutorial.

And there are some minor bugs fixed.

Continue reading “Release of shorttext 0.5.4”

Release of shorttext 0.3.3

On November 21, 2016, the Python package `shorttext’ was published. Until today, more than seven versions have been published. There have been a drastic architecture change, but the overall purpose is still the same, as summarized in the first introduction entry:

This package `shorttext‘ was designed to tackle all these problems… It contains the following features:

  • example data provided (including subject keywords and NIH RePORT);
  • text preprocessing;
  • pre-trained word-embedding support;
  • gensim topic models (LDA, LSI, Random Projections) and autoencoder;
  • topic model representation supported for supervised learning using scikit-learn;
  • cosine distance classification; and
  • neural network classification (including ConvNet, and C-LSTM).

And since the first version, there have been updates, as summarized in the documention (News):

Version 0.3.3 (Apr 19, 2017)

  • Deleted CNNEmbedVecClassifier.
  • Added script ShortTextWord2VecSimilarity.

Version 0.3.2 (Mar 28, 2017)

  • Bug fixed for gensim model I/O;
  • Console scripts update;
  • Neural networks up to Keras 2 standard (refer to this).

Version 0.3.1 (Mar 14, 2017)

  • Compact model I/O: all models are in single files;
  • Implementation of stacked generalization using logistic regression.

Version 0.2.1 (Feb 23, 2017)

  • Removal attempts of loading GloVe model, as it can be run using gensim script;
  • Confirmed compatibility of the package with tensorflow;
  • Use of spacy for tokenization, instead of nltk;
  • Use of stemming for Porter stemmer, instead of nltk;
  • Removal of nltk dependencies;
  • Simplifying the directory and module structures;
  • Module packages updated.

Although there are still additions that I would love to add, but it would not change the overall architecture. I may add some more supervised learning algorithms, but under the same network. The upcoming big additions will be generative models or seq2seq models, but I do not see them coming in the short term. I will add corpuses.

I may add tutorials if I have time.

I am thankful that there is probably some external collaboration with other Python packages. Some people have already made some useful contributions. It will be updated if more things are confirmed.

Continue reading “Release of shorttext 0.3.3”

Natural Language Generation

I have worked a lot on text categorization in the past few months, and I started to get bored. I started to become more interested in generative models, and generating texts.

Generative models are not new. Topic models such as LDA, or STM are generative models. However, I have been using the topic vectors or other topic models such as LDA2Vec as the feature of another supervised algorithm. And it is basically the design of my shorttext package.

I attended a meetup event held by DC Data Science and Data Education DC. The speaker, Daewoo Chong, is a senior Data Scientist at Booz Allen Hamilton. He talked about chatbot, building on RNN models on characters. His talk was not exactly about generative models, but it is indeed about generating texts. With the sophistication of GANs (see my entry on GAN and WGAN), it will surely be my next focus of my toy projects.

Ran Chen wrote a blog on his company homepage about natural language generation in his system, Trulia.

And there are a few GAN applications on text:

  • “Generating Text via Adversarial Learning” [PDF]
  • Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu, “SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient,” arXiv:1609.05473 [arXiv]
  • Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, Dan Jurafsky, “Adversarial Learning for Neural Dialogue Generation,” arXiv:1701.06547 [arXiv]
  • Matt J. Kusner, José Miguel Hernández-Lobato, “GANs for sequence of discrete elements with the Gumbel-softmax distribution,” arXiv:1611.04051 [arXiv]
  • David Pfau, Oriol Vinyals, “Connecting generative adversarial network and actor-critic methods,” arXiv:1610.01945 [arXiv]
  • Xuerong Xiao, “Text Generation usingGenerative Adversarial Training” [PDF]

Simple Literary Analytics on Presidential Candidates in the First 2016 Presidential Debate

The first presidential debate 2016 was held on September 26, 2016 in Hofstra University in New York. An interesting analysis will be the literacy level demonstrated by the two candidates using Flesch readability ease and Flesch-Kincaid grade level, demonstrated in my previous blog entry and my Github: stephenhky/PyReadability.

First, we need to get the transcript of the debate, which can be found in an article in New York Times. Copy and paste the text into a file called first_debate_transcript.txt. Then we want to extract out speech of each person. To do this, store the following Python code in first_debate_segment.py.

# Trump and Clinton 1st debate on Sept 26, 2016

from nltk import word_tokenize
from collections import defaultdict
import re

# adopted from http://stackoverflow.com/questions/21948019/python-untokenize-a-sentence
def untokenize(words):
    Untokenizing a text undoes the tokenizing operation, restoring
    punctuation and spaces to the places that people expect them to be.
    Ideally, `untokenize(tokenize(text))` should be identical to `text`,
    except for line breaks.
    text = ' '.join(words)
    step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .',  '...')
    step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
    step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2)
    step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3)
    step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
         "can not", "cannot")
    step6 = step5.replace(" ` ", " '")
    return step6.strip()

ignored_phrases = ['(APPLAUSE)', '(CROSSTALK)']
persons = ['TRUMP', 'CLINTON', 'HOLT']
fin = open('first_debate_transcript.txt', 'rb')
lines = fin.readlines()

lines = filter(lambda s: len(s)>0, map(lambda s: s.strip(), lines))
speeches = defaultdict(lambda : '')
person = None

for line in lines:
    tokens = word_tokenize(line.strip())
    ignore_colon = False
    added_tokens = []
    for token in tokens:
        if token in ignored_phrases:
        elif token in persons:
            person = token
            ignore_colon = True
        elif token == ':':
            ignore_colon = False
            added_tokens += [token]
            speeches[person] += ' ' + untokenize(added_tokens)

for person in persons:
    fout = open('speeches_'+person+'.txt', 'wb')

There is an untokenize function adapted from a code in StackOverflow. This segmented the transcript into the individual speech of Lester Holt (the host of the debate), Donald Trump (GOP presidential candidate), and Hillary Clinton (DNC presidential candidate) in separate files. Then, on UNIX or Linux command line, run score_readability.py on each person’s script, by, for example, for Holt’s speech,

python score_readability.py speeches_HOLT.txt --utf8

Beware that it is encoded in UTF-8. For Lester Holt, we have

Word count = 1935
Sentence count = 157
Syllable count = 2732
Flesch readability ease = 74.8797052289
Flesch-Kincaid grade level = 5.87694629602

For Donald Trump,

Word count = 8184
Sentence count = 693
Syllable count = 10665
Flesch readability ease = 84.6016324536
Flesch-Kincaid grade level = 4.3929136992

And for Hillary Clinton,

Word count = 6179
Sentence count = 389
Syllable count = 8395
Flesch readability ease = 75.771973015
Flesch-Kincaid grade level = 6.63676650035

Apparently, compared to Donald Trump, Hillary Clinton has a higher literary level, but her speech is less easy to understand.

Recalling from my previous entry, for Shakespeare’s MacBeth, the Flesch readability ease is 112.278048591, and Flesch-Kincard grade level 0.657934056288; for King James Version Bible (KJV), they are 79.6417489428 and 9.0085275366 respectively.

This is just a simple text analytics. However, the content is not analyzed here. Augustine of Hippo wrote in his Book IV of On Christian Teaching (Latin: De doctrina christiana) about rhetoric and eloquence:

“… wisdom without eloquence is of little value to the society… eloquence without wisdom is… a great nuisance, and never beneficial.” — Augustine of Hippo, Book IV of On Christian Teaching


Continue reading “Simple Literary Analytics on Presidential Candidates in the First 2016 Presidential Debate”

SOCcer: Computerized Coding In Epidemiology

There are many tasks that involve coding, for example, putting kids into groups according to their age, labeling the webpages about their kinds, or putting students in Hogwarts into four colleges… And researchers or lawyers need to code people, according to their filled-in information, into occupations. Melissa Friesen, an investigator in Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NCI), National Institutes of Health (NIH), saw the need of large-scale coding. Many researchers are dealing with big data concerning epidemiology. She led a research project, in collaboration with Office of Intramural Research (OIR), Center for Information Technology (CIT), National Institutes of Health (NIH), to develop an artificial intelligence system to cope with the problem. This leads to a publicly available tool called SOCcer, an acronym for “Standardized Occupation Coding for Computer-assisted Epidemiological Research.” (URL: http://soccer.nci.nih.gov/soccer/)

The system was initially developed in an attempt to find the correlation between the onset of cancers and other diseases and the occupation. “The application is not intended to replace expert coders, but rather to prioritize which job descriptions would benefit most from expert review,” said Friesen in an interview. She mainly works with Daniel Russ in CIT.

SOCcer takes job title, industry codes (in terms of SIC, Standard Industrial Classification), and job duties, and gives an occupational code called SOC 2010 (Standard Occupational Classification), used by U. S. federal government agencies. The data involves short text, often messy. There are 840 codes in SOC 2010 systems. Conventional natural language processing (NLP) methods may not apply. Friesen, Russ, and Kwan-Yuet (Stephen) Ho (also in OIR, CIT; a CSRA staff) use fuzzy logic, and maximum entropy (maxent) methods, with some feature engineering, to build various classifiers. These classifiers are aggregated together, as in stacked generalization (see my previous entry), using logistic regression, to give a final score.

SOCcer has a companion software, called SOCAssign, for expert coders to prioritize the codings. It was awarded with DCEG Informatics Tool Challenge 2015. SOCcer itself was awarded in 2016. And the SOCcer team was awarded for Scientific Award of Merit by CIT/OCIO in 2016 as well (see this). Their work was published in Occup. Environ. Med.


Continue reading “SOCcer: Computerized Coding In Epidemiology”

Book on NLP Annotation

It is really a pleasure to participate in DC NLP meetup group. And it was so nice to listen to Ari Chanen’s talk on his application of ensembles of Word2Vec, and Mark Lidd’s on the product of DataFission.

Before the talks, the co-organizers gave out sponsored books. I am so lucky to get James Pustejovsky’s and Amber Stubb’s Natual Language Annotation for Machine Learning. It is a book on annotating NLP corpuses, and its process cycle. I never annotate the corpus, but I know it is an important task, however tedious it is.


Continue reading “Book on NLP Annotation”

Probabilistic Theory of Word Embeddings: GloVe

The topic of word embedding algorithms has been one of the interests of this blog, as in this entry, with Word2Vec [Mikilov et. al. 2013] as one of the main examples. It is a great tool for text mining, (for example, see [Czerny 2015],) as it reduces the dimensions needed (compared to bag-of-words model). As an algorithm borrowed from computer vision, a lot of these algorithms use deep learning methods to train the model, while it was not exactly sure why it works. Despite that, there are many articles talking about how to train the model. [Goldberg & Levy 2014, Rong 2014 etc.] Addition and subtraction of the word vectors show amazing relationships that carry semantic values, as I have shown in my previous blog entry. [Ho 2015]

However, Tomas Mikolov is no longer working in Google, making the development of this algorithm discontinued. As a follow-up of their work, Stanford NLP group later proposed a model, called GloVe (Global Vectors), that embeds word vectors using probabilistic methods. [Pennington, Socher & Manning 2014] It can be implemented in the package glove-python in python, and text2vec in R (or see their CRAN post).  Their paper is neatly written, and a highly recommended read.

To explain the theory of GloVe, we start with some basic probabilistic picture in basic natural language processing (NLP). We suppose the relation between the words occur in certain text windows within a corpus, but the details are not important here. Assume that i, j, and k are three words, and the conditional probability P_{ik} is defined as

P_{ij} = P(j | i) = \frac{X_{ij}}{X_i},

where X‘s are the counts, and similarly for P_{jk}. And we are interested in the following ratio:

F(w_i, w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}.

The tilde means “context,” but we will later assume it is also a word. Citing the example from their paper, take i as ice, and j as steam. if k is solid, then the ratio is expected to be large; or if k is gas, then it is expected to be low. But if k is water, which are related to both, or fashion, which is related to none, then the ratio is expected to be approximately 1.

And the addition and subtraction in Word2Vec is similar to this. We want the ratio to be like the subtraction as in Word2Vec (and multiplication as in addition), then we should modify the function F such that,

F(w_i - w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}.

On the other hand, the input arguments of F are vectors, but the output is a scalar. We avoid the issue by making the input argument as a dot product,

F( (w_i - w_j)^T \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}.

In NLP, the word-word co-occurrence matrices are symmetric, and our function F should also be invariant under switching the labeling. We first require F is be a homomorphism,

F((w_i - w_j)^T \tilde{w}_k) = \frac{F(w_i^T \tilde{w}_k) }{ F(w_j^T \tilde{w}_k)},

where we define,

F(w_i^T \tilde{w}_k) = P_{ik} = \frac{X_{ik}}{X_i}.

It is clear that F is an exponential function, but to ensure symmetry, we further define:

w_i^T \tilde{w}_k + b_i + \tilde{b}_k = \log X_{ik}.

As a result of this equation, the authors defined the following cost function to optimize for GloVe model:

J = \sum_{i, j=1}^V f(X_{ij}) \left( w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ik} \right)^2,

where w_j, \tilde{w}_j, b_i, and \tilde{b}_j are parameters to learn. f(x) is a weighting function. (Refer the details to the paper.) [Pennington, Socher & Manning 2014]

As Radim Řehůřek said in his blog entry, [Řehůřek 2014] it is a neat paper, but their evaluation is crappy.

This theory explained why certain similar relations can be achieved, such as Paris – France is roughly equal to Beijing – China, as both can be transformed to the ratio in the definition of F above.

It is a neat paper, as it employs optimization theory and probability theory, without any dark box deep learning.

Continue reading “Probabilistic Theory of Word Embeddings: GloVe”

Create a free website or blog at WordPress.com.

Up ↑