Probabilistic Theory of Word Embeddings: GloVe

The topic of word embedding algorithms has been one of the interests of this blog, as in this entry, with Word2Vec [Mikilov et. al. 2013] as one of the main examples. It is a great tool for text mining, (for example, see [Czerny 2015],) as it reduces the dimensions needed (compared to bag-of-words model). As an algorithm borrowed from computer vision, a lot of these algorithms use deep learning methods to train the model, while it was not exactly sure why it works. Despite that, there are many articles talking about how to train the model. [Goldberg & Levy 2014, Rong 2014 etc.] Addition and subtraction of the word vectors show amazing relationships that carry semantic values, as I have shown in my previous blog entry. [Ho 2015]

However, Tomas Mikolov is no longer working in Google, making the development of this algorithm discontinued. As a follow-up of their work, Stanford NLP group later proposed a model, called GloVe (Global Vectors), that embeds word vectors using probabilistic methods. [Pennington, Socher & Manning 2014] It can be implemented in the package glove-python in python, and text2vec in R (or see their CRAN post).  Their paper is neatly written, and a highly recommended read.

To explain the theory of GloVe, we start with some basic probabilistic picture in basic natural language processing (NLP). We suppose the relation between the words occur in certain text windows within a corpus, but the details are not important here. Assume that i, j, and k are three words, and the conditional probability P_{ik} is defined as

P_{ij} = P(j | i) = \frac{X_{ij}}{X_i},

where X‘s are the counts, and similarly for P_{jk}. And we are interested in the following ratio:

F(w_i, w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}.

The tilde means “context,” but we will later assume it is also a word. Citing the example from their paper, take i as ice, and j as steam. if k is solid, then the ratio is expected to be large; or if k is gas, then it is expected to be low. But if k is water, which are related to both, or fashion, which is related to none, then the ratio is expected to be approximately 1.

And the addition and subtraction in Word2Vec is similar to this. We want the ratio to be like the subtraction as in Word2Vec (and multiplication as in addition), then we should modify the function F such that,

F(w_i - w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}.

On the other hand, the input arguments of F are vectors, but the output is a scalar. We avoid the issue by making the input argument as a dot product,

F( (w_i - w_j)^T \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}.

In NLP, the word-word co-occurrence matrices are symmetric, and our function F should also be invariant under switching the labeling. We first require F is be a homomorphism,

F((w_i - w_j)^T \tilde{w}_k) = \frac{F(w_i^T \tilde{w}_k) }{ F(w_j^T \tilde{w}_k)},

where we define,

F(w_i^T \tilde{w}_k) = P_{ik} = \frac{X_{ik}}{X_i}.

It is clear that F is an exponential function, but to ensure symmetry, we further define:

w_i^T \tilde{w}_k + b_i + \tilde{b}_k = \log X_{ik}.

As a result of this equation, the authors defined the following cost function to optimize for GloVe model:

J = \sum_{i, j=1}^V f(X_{ij}) \left( w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ik} \right)^2,

where w_j, \tilde{w}_j, b_i, and \tilde{b}_j are parameters to learn. f(x) is a weighting function. (Refer the details to the paper.) [Pennington, Socher & Manning 2014]

As Radim Řehůřek said in his blog entry, [Řehůřek 2014] it is a neat paper, but their evaluation is crappy.

This theory explained why certain similar relations can be achieved, such as Paris – France is roughly equal to Beijing – China, as both can be transformed to the ratio in the definition of F above.

It is a neat paper, as it employs optimization theory and probability theory, without any dark box deep learning.

Continue reading “Probabilistic Theory of Word Embeddings: GloVe”

Advertisements

Computational Folkloristics: Major Emotional Arcs for Good-Selling Fictions

The emotional flows of stories are important to engage the readers. Skillful writers grasp this very well by natural instinct. There are theories about this, called folkloristics. However, is there a way to see the flows in a graph? Linear algebra and natural language processing (NLP) kick in.

Andrew Reagan at the Computational Story Lab, University of Vermont, together with his colleagues and collaborators, did a numerical studies about this. [Reagan et. al., 2016] Their paper is now on the arXiv. He prepared a set of words with scores that quantitatively describe their sentiments, as in sentiment analysis. He then went through the text with a sliding window to measure the sentiments. Then for each book, there is a vector of a time series of these sentiment scores. For example, using this method, the plot of the emotional scores, or the emotional arc, of J. K. Rowling’s Harry Potter and the Deathly Hallows is as shown in the following plot: [Reagan et. al., 2016]

harry_potter

They did the same thing with other English fictions in the Project Gutenberg Corpus, giving a vector of these emotional scores for each fiction. They performed a principal component analysis (PCA) for all these books (represented by a matrix containing all vectors). PCA is a common dimensionality reduction techniques, and also useful for information retrieval (IR) in another name called latent semantic analysis (LSA). Reagan and his colleagues identify six major components of these emotional arcs, as shown below: [Reagan et. al., 2016]

emotional_arcs

These computational studies on fictions further reinforce our common belief that (good-selling) fictions do have resonating themes to keep the readers.

Continue reading “Computational Folkloristics: Major Emotional Arcs for Good-Selling Fictions”

Coffee Cooling: a Demo of Numerical ODEs in Python

A lot of Americans are addicted to coffee. I have one cup a day. I know someone has a few cups every day. Weird enough, I know someone who needs coffee to go to bed.

Coffee is for sipping, but we do want to enjoy it as soon as possible at a right temperature. Whether we add sugar cubes at the beginning or the end can make a lot of difference. In general, coffee cooling follows empirically the Newton’s law of cooling:

\frac{dT}{dt} = -k (T-T_r),

where T_r is the room temperature, and k is the cooling parameter. It is a simple ordinary differential equation (ODE) that bears analytical solution. However, let’s solve it numerically while we demonstrate the NumPy/SciPy capabilities. The function we are using is odeint. Let’s import necessary packages:

import numpy as np
from scipy.integrate import odeint
from functools import partial

We will need the partial function from functools shortly. Let’s define the right-hand-side of the ODE above as the cooling function in terms of a lambda expression:

coolingfcn = lambda temp, t0, k, roomtemp: -k*(temp-roomtemp)

And we need a set of cooling parameters that makes sense from our daily experience:

cooling_parameters = {'init_temp': 94.0, # initial temperature of the coffee (degrees Celcius)
                      'cube_dectemp': 5.0, # temperature decrease due to the cube
                      'roomtemp': 25.0, # room temperature
                      'k': 1.0/15.0 # cooling parameter (inverse degrees Celcius)
                     }

The fresh-brewed coffee is of 94 degrees Celcius. The room temperature is 25 degrees Celcius. Whenever a whole cube is added to the coffee, its temperature drops by 5 degrees Celcius, which is not a good assumption when the coffee is totally cooled down to room temperature. The cooling parameter k=\frac{1}{15} \text{min}^{-1}, meaning in 15 minutes, the coffee was cooled by e^{-1} of its temperature difference from the environment.

From the SciPy documentation of odeint, the first three parameters are important:

scipy.integrate.odeint(func, y0, t, args=(), Dfun=None, col_deriv=0, full_output=0, ml=None, mu=None, rtol=None,atol=None, tcrit=None, h0=0.0, hmax=0.0, hmin=0.0, ixpr=0, mxstep=0, mxhnil=0, mxordn=12, mxords=5,printmessg=0)

The first parameter is the cooling function, and the second is the initial value of the dependent function (temperature in this case), and the third is the ndarray of times of the temperatures to be calculated. We can easily write the functions that return the ndarray of the temperatures for the cases of adding sugar cube before and after the cooling process respectively:

# adding sugar cube before cooling
def cube_before_cooling(t, init_temp, cube_dectemp, roomtemp, k):
    y0 = init_temp - cube_dectemp
    temps = odeint(partial(coolingfcn, k=k, roomtemp=roomtemp), y0, t)
    return temps

# adding sugar cube after cooling
def cube_after_cooling(t, init_temp, cube_dectemp, roomtemp, k):
    temps = odeint(partial(coolingfcn, k=k, roomtemp=roomtemp), init_temp, t)
    temps[-1] -= cube_dectemp
    return temps

We can plot the temperature changes on graphs using matplotlib.

import matplotlib.pyplot as plt

And we are interested in the first 20 minutes:

times = np.linspace(0, 20, 201)

For adding a sugar cube before cooling,

temps_beforecooling = cube_before_cooling(times, **cooling_parameters)
plt.plot(times, temps_beforecooling)
plt.xlabel('Time (min)')
plt.ylabel('Coffee Temperature (degrees Celcius)')

(If you are typing in an iPython shell, you need to put plt.show() to show the plot.) This gives the following plot:

download (2)

And for adding the sugar cube after cooling,

temps_aftercooling = cube_after_cooling(times, **cooling_parameters)
plt.plot(times, temps_aftercooling)
plt.xlabel('Time (min)')
plt.ylabel('Coffee Temperature (degrees Celcius)')

This gives the following plot:

download (3)

Obviously, adding the sugar cube after cooling gives a lower temperature.

Advanced Gimmicks

How about we add sugar continuously throughout the 20-minute time span?  We need to adjust our differential equation to be:

\frac{dT}{dt} = -k (T-T_r) - \frac{T_{\text{drop}}}{\delta t},

which can be implemented by the following code:

# adding sugar continuously
def continuous_sugar_cooling(t, init_temp, cube_dectemp, roomtemp, k):
    timespan = t[-1] - t[0]
    newcoolingfcn = lambda temp, t0: coolingfcn(temp, t0, k, roomtemp) - cube_dectemp / timespan
    temps = odeint(newcoolingfcn, init_temp, t)
    return temps

or we can divide the cube to be added to the cup at a few time spots, which can be implemented by the following code that employs our previously defined functions:

# adding sugar at a specified time(s)
def sugar_specifiedtime_cooling(t, sugar_time_indices, init_temp, cube_dectemp, roomtemp, k):
    sorted_sugar_time_indices = np.sort(sugar_time_indices)
    num_portions = len(sorted_sugar_time_indices)

    temps = np.array([])
    temp = init_temp
    t_segments = np.split(t, sorted_sugar_time_indices)
    num_segments = len(t_segments)
    for i in range(num_segments):
        temp_segment = cube_after_cooling(t_segments[i],
                                          temp,
                                          float(cube_dectemp)/num_portions,
                                          roomtemp,
                                          k)
        temps = np.append(temps, temp_segment)
        temp = temp_segment[-1]
    temps[-1] += float(cube_dectemp)/num_portions

    return temps

Let’s calculate all the temperatures:

temps_cont = continuous_sugar_cooling(times, **cooling_parameters)
temps_n = sugar_specifiedtime_cooling(times, [50, 100, 150], **cooling_parameters)

And plot all the graphs together:

plt.plot(times, temps_beforecooling, label='cube before cooling')
plt.plot(times, temps_aftercooling, label='cube after cooling')
plt.plot(times, temps_cont, label='continuous')
plt.plot(times, temps_n, label='3 times of dropping cube')
plt.xlabel('Time (min)')
plt.ylabel('Coffee Temperature (degrees Celcius)')
plt.legend()

download (4)

Complete code demonstration can be found in my github (GitHub/stephenhky/CoffeeCooling)

Continue reading “Coffee Cooling: a Demo of Numerical ODEs in Python”

Blog at WordPress.com.

Up ↑