Embedding has been hot in recent years partly due to the success of Word2Vec, (see demo in my previous entry) although the idea has been around in academia for more than a decade. The idea is to transform a vector of integers into continuous, or embedded, representations. Keras, a Python package that implements neural network models (including the ANN, RNN, CNN etc.) by wrapping Theano or TensorFlow, implemented it, as shown in the example below (which converts a vector of 200 features into a continuous vector of 10):
from keras.layers import Embedding
from keras.models import Sequential
# define and compile the embedding model
model = Sequential()
model.add(Embedding(200, 10, input_length=1))
model.compile('rmsprop', 'mse') # optimizer: rmsprop; loss function: mean-squared error
We can then convert any features from 0 to 199 into vectors of 20, as shown below:
import numpy as np
model.predict(np.array([10, 90, 151]))
Of course, one must not omit a similar algorithm called GloVe, developed by the Stanford NLP group. Their codes have been wrapped in both Python (package called glove) and R (library called text2vec).
Besides Word2Vec, there are other word embedding algorithms that try to complement Word2Vec, although many of them are more computationally costly. Previously, I introduced LDA2Vec in my previous entry, an algorithm that combines the locality of words and their global distribution in the corpus. And in fact, word embedding algorithms with a similar ideas are also invented by other scientists, as I have introduced in another entry.
However, there are word embedding algorithms coming out. Since most English words carry more than a single sense, different senses of a word might be best represented by different embedded vectors. Incorporating word sense disambiguation, a method called sense2vec has been introduced by Trask, Michalak, and Liu. (arXiv:1511.06388). Matthew Honnibal wrote a nice blog entry demonstrating its use.
There are also other related work, such as wang2vec that is more sensitive to word orders.
Big Bang Theory (Season 2, Episode 5): Euclid Alternative
DMV staff: Application? Sheldon: I’m actually more or a theorist.
Note: feature image taken from Big Bang Theory (CBS).
Word2Vec has hit the NLP world for a while, as it is a nice method for word embeddings or word representations. Its use of skip-gram model and deep learning made a big impact too. It has been my favorite toy indeed. However, even though the words do have a correlation across a small segment of text, it is still a local coherence. On the other hand, topic models such as latent Dirichlet allocation (LDA) capture the distribution of words within a topic, and that of topics within a document etc. And it provides a representation of a new document in terms of a topic.
In my previous blog entry, I introduced Chris Moody’s LDA2Vec algorithm (see: his SlideShare). Unfortunately, not many papers or blogs have covered this new algorithm too much despite its potential. The API is not completely well documented yet, although you can see its example from its source code on its Github. In its documentation, it gives an example of deriving topics from an array of random numbers, in its lda2vec/lda2vec.py code:
A more comprehensive example is in examples/twenty_newsgroup/lda.py .
Besides, LDA2Vec, there are some related research work on topical word embeddings too. A group of Australian and American scientists studied about the topic modeling with pre-trained Word2Vec (or GloVe) before performing LDA. (See: their paper and code) On the other hand, another group with Chinese and Singaporean scientists performs LDA, and then trains a Word2Vec model. (See: their paper and code) And LDA2Vec concatenates the Word2Vec and LDA representation, like an early fusion.
No matter what, representations with LDA models (or related topic modeling such as correlated topic models (CTM)) can be useful even outside NLP. I have found it useful at some intermediate layer of calculation lately.
Both LDA (latent Dirichlet allocation) and Word2Vec are two important algorithms in natural language processing (NLP). LDA is a widely used topic modeling algorithm, which seeks to find the topic distribution in a corpus, and the corresponding word distributions within each topic, with a prior Dirichlet distribution. Word2Vec is a vector-representation model, trained from RNN (recurrent neural network), to seek a continuous representation for words.
They are both very useful, but LDA deals with words and documents globally, and Word2Vec locally (depending on adjacent words in the training data). A LDA vector is so sparse that the users can interpret the topic easily, but it is inflexible. Word2Vec’s representation is not human-interpretable, but it is easy to use. In his slides, Chris Moody recently devises a topic modeling algorithm, called LDA2Vec, which is a hybrid of the two, to get the best out of the two algorithms.
Honestly, I never used this algorithm. I rarely talk about something I didn’t even try, but I want to raise awareness so that more people know about it when I come to use it. To me, it looks like concatenating two vectors with some hyperparameters, but the source codes rejects this claim. It is a topic model algorithm.
There are not many blogs or papers talking about LDA2Vec yet. I am looking forward to learning more about it when there are more awareness.
One fascinating application of deep learning is the training of a model that outputs vectors representing words. A project written in Google, named Word2Vec, is one of the best tools regarding this. The vector representation captures the word contexts and relationships among words. This tool has been changing the landscape of natural language processing (NLP).
Let’s have some demonstration. To use Word2Vec in Python, you need to have the package gensim installed. (Installation instruction: here) And you have to download a trained model (GoogleNews-vectors-negative300.bin.gz), which is 3.6 GB big!! When you get into a Python shell (e.g., IPython), type
from gensim.models.word2vec import Word2Vec
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
This model enables the user to extract vector representation of length 300 of an English word. So what is so special about this vector representation from the traditional bag-of-words representation? First, the representation is standard. Once trained, we can use it in future training or testing dataset. Second, it captures the context of the word in a way that the algebraic operation of these vectors has meanings.
Here I give 5 examples.
A Juvenile Cat
What is a juvenile cat? We know that a juvenile dog is a puppy. Then we can get it by carry out the algebraic calculation by running
which indicates that “kitten” is the answer (correctly!) The numbers are similarities of these words with the vector representation in descending order. You can verify it by calculating the cosine distance:
from scipy.spatial import distance
print (1-distance.cosine(model['kitten'], model['puppy']+model['cat']-model['dog']))
which outputs 0.763498957413.
This demonstration shows that in the model, and are of similar semantic relations.
Capital of Taiwan
Where is the capital of Taiwan? We can find it if we know the capital of another country. For example, we know that Beijing is the capital of China. Then we can run the following:
However, this model does not always work. If it can find the capital of Taiwan, can it find those for any states in the United States? We know that the capital of California is Sacramento. How about Maryland? Let’s run:
Word2Vec was developed by Tomáš Mikolov. He previously worked for Microsoft Research. However, he switched to Google, and published a few influential works on Word2Vec. [Mikolov, Yih, Zweig 2013] [Mikolov, Sutskever, Chen, Corrado, Dean 2013] [Mikolov, Chen, Corrado, Dean 2013] Their conference paper in 2013 can be found on arXiv. He later published a follow-up work on a package called Doc2Vec that considers phrases. [Le, Mikolov 2014]
Earlier this year, I listened to a talk in DCNLP meetup spoken by Michael Czerny on his award-winning blog entry titled “Modern Methods for Sentiment Analysis.” He applied the vector representations of words by Word2Vec to perform sentiment analysis, assuming that similar sentiments cluster together in the vector space. (He took averages of the vectors in tweets to extract emotions.) [Czerny 2015] I highly recommend you to read his blog entry. On the other hand, Xin Rong wrote an explanation about how Word2Vec works too. [Rong 2014]
There seems to be no progress on the project Word2Vec anymore as Tomáš Mikolov no longer works in Google. However, the Stanford NLP Group recognized that Word2Vec captures the relations between words in their vector representation. They worked on a similar project, called GloVe (Global Vectors), which tackles the problem with matrix factorization. [Pennington, Socher, Manning 2014] Radim Řehůřek did some analysis comparing Word2Vec and GloVe. [Řehůřek 2014] GloVe can be implemented in Python too.
On October 14, 2015, I attended the regular meeting of the DCNLP meetup group, a group on natural language processing (NLP) in Washington, DC area. The talk was titled “Deep Learning for Question Answering“, spoken by Mr. Mohit Iyyer, a Ph.D. student in Department of Computer Science, University of Maryland (my alma mater!). He is a very good speaker.
I have no experience on deep learning at all although I did write a blog post remotely related. I even didn’t start training my first neural network until the next day after the talk. However, Mr. Iyyer explained what recurrent neural network (RNN), recursive neural network, and deep averaging network (DAN) are. This helped me a lot in order to understanding more about the principles of the famous word2vec model (which is something I am going to write about soon!). You can refer to his slides for more details. There are really a lot of talents in College Park, like another expert, Joe Yue Hei Ng, who is exploiting deep learning a lot as well.
The applications are awesome: with external knowledge to factual question answering, reasoning-based question answering, and visual question answering, with increasing order of challenging levels.
Mr. Iyyer and the participants discussed a lot about different packages. Mr. Iyyer uses Theano, a Python package for deep learning, which is good for model building and other analytical work. Some prefer Caffe. Some people, who are Java developers, also use deeplearning4j.
This meetup was a sacred one too, because it is the last time it was held in Stetsons Famous Bar & Grill at U Street, which is going to permanently close on Halloween this year. The group is eagerly looking for a new venue for the upcoming meetup. This meeting was a crowded one. I sincerely thank the organizers, Charlie Greenbacker and Liz Merkhofer, for hosting all these meetings, and Chris Phipps (a linguist from IBM Watson) for recording.
It is a fantasy for a lot of entrepreneurs, scientists and engineers to develop a software project that can automatically perform feature generation, training, and prediction automatically.
Of course it is a wishful thinking. There is no free lunch.
In big companies that have abundant resources (training data, brains, clusters), they can probably so something like deep learning to get the relevant features, and build classification models. It is almost automatic. It virtually takes no manual addition of human knowledge. Some scientists and engineers are enjoying the strength of word2vec, but it takes a lot of computer resources to even train a word2vec model.
If we do not have enough training data or computing resources, to get a good classifier, we ought to add human knowledge to generate features. We might even need to impose some rules to convert the raw data to sensible features. The rules might be regular expressions, or some calculations, or some filters, or it involves a knowledge database (like WordNet). Things might be simplified if the problem we are dealing with is in a specific domain, that reduces the amount of human knowledge we need to add.
However, you can see that they are pretty dumb when you try them out. I have a feeling that those conversational models are Markov models, and the chatbots basically forget what they previously said.
A few days ago, Google put a paper on arXiv preprint that described how they applied neural network conversational models in chatbots. [Vinyals & Le 2015] In short they look for more previous sentences in the conversations. It is certainly inspired by some previous natural language processing (NLP) work from Google, especially Word2Vec that employs skip-gram models. [Mikolov et. al. 2013]
Deep learning is what a lot of big guys are trying now.
Python is the basic programming languages if one wants to work on data nowadays. Its popularity comes with its intuitive syntax, its support of several programming paradigms, and the package numpy (Numerical Python). Yes, if you asked which package is a “must-have” outside the standard Python packages, I would certainly name numpy.
Let me list some useful packages that I have found useful:
numpy: Numerical Python. Its basic data type is ndarray, which acts like a vector with vectorized calculation support. It makes Python to perform matrix calculation efficiently like MATLAB and Octave. It supports a lot of commonly used linear algebraic algorithms, such as eigenvalue problems, SVD etc. It is the basic of a lot of other Python packages that perform heavy numerical computation. It is such an important package that, in some operating systems, numpy comes with Python as well.
scipy: Scientific Python. It needs numpy, but it supports also sparse matrices, special functions, statistics, numerical integration…
scikit-learn: machine learning library. It contains a number of supervised and unsupervised learning algorithms.
nltk: natural language processing. It provides not only basic tools like stemmers, lemmatizers, but also some algorithms like maximum entropy, tf-idf vectorizer etc. It provides a few corpuses, and supports WordNet dictionary.
gensim: another useful natural language processing package with an emphasis on topic modeling. It mainly supports Word2Vec, latent semantic indexing (LSI), and latent Dirichlet allocation (LDA). It is convenient to construct term-document matrices, and convert them to matrices in numpy or scipy.
networkx: a package that supports both undirected and directed graphs. It provides basic algorithms used in graphs.
pandas: it supports data frame handling like R. (I have not used this package as I am a heavy R user.)
Of course, if you are a numerical developer, to save you a good life, install Anaconda.
There are some other packages that are useful, such as PyCluster (clustering), xlrd (Excel files read/write), PyGame (writing games)… But since I have not used them, I would rather mention it in this last paragraph, not to endorse but avoid devaluing it.