ConvNet Seq2seq for Machine Translation

In these few days, Facebook published a new research paper, regarding the use of sequence to sequence (seq2seq) model for machine translation. What is special about this seq2seq model is that it uses convolutional neural networks (ConvNet, or CNN), instead of recurrent neural networks (RNN).

The original seq2seq model is implemented with Long Short-Term Memory (LSTM) model, published by Google.(see their paper) It is basically a character-based model that generates texts according to a sequence of input characters. And the same author constructed a neural conversational model, (see their paper) as mentioned in a previous blog post. Daewoo Chong, from Booz Allen Hamilton, presented its implementation using Tensorflow in DC Data Education Meetup on April 13, 2017. Johns Hopkins also published a spell correction algorithm implemented in seq2seq. (see their paper) The real advantage of RNN over CNN is that there is no limit about the size of the tokens input or output.

While the fixing of the size of vectors for CNN is obvious, using CNN serves the purpose of limiting the size of input vectors, and thus limiting the size of contexts. This limits the contents, and speeds up the training process. RNN is known to be trained slow. Facebook uses this CNN seq2seq model for their machine translation model. For more details, take a look at their paper and their Github repository.

v2-519320fbe0a7d45fb9102fc05970ec10_b

Continue reading “ConvNet Seq2seq for Machine Translation”

Release of shorttext 0.3.3

On November 21, 2016, the Python package `shorttext’ was published. Until today, more than seven versions have been published. There have been a drastic architecture change, but the overall purpose is still the same, as summarized in the first introduction entry:

This package `shorttext‘ was designed to tackle all these problems… It contains the following features:

  • example data provided (including subject keywords and NIH RePORT);
  • text preprocessing;
  • pre-trained word-embedding support;
  • gensim topic models (LDA, LSI, Random Projections) and autoencoder;
  • topic model representation supported for supervised learning using scikit-learn;
  • cosine distance classification; and
  • neural network classification (including ConvNet, and C-LSTM).

And since the first version, there have been updates, as summarized in the documention (News):

Version 0.3.3 (Apr 19, 2017)

  • Deleted CNNEmbedVecClassifier.
  • Added script ShortTextWord2VecSimilarity.

Version 0.3.2 (Mar 28, 2017)

  • Bug fixed for gensim model I/O;
  • Console scripts update;
  • Neural networks up to Keras 2 standard (refer to this).

Version 0.3.1 (Mar 14, 2017)

  • Compact model I/O: all models are in single files;
  • Implementation of stacked generalization using logistic regression.

Version 0.2.1 (Feb 23, 2017)

  • Removal attempts of loading GloVe model, as it can be run using gensim script;
  • Confirmed compatibility of the package with tensorflow;
  • Use of spacy for tokenization, instead of nltk;
  • Use of stemming for Porter stemmer, instead of nltk;
  • Removal of nltk dependencies;
  • Simplifying the directory and module structures;
  • Module packages updated.

Although there are still additions that I would love to add, but it would not change the overall architecture. I may add some more supervised learning algorithms, but under the same network. The upcoming big additions will be generative models or seq2seq models, but I do not see them coming in the short term. I will add corpuses.

I may add tutorials if I have time.

I am thankful that there is probably some external collaboration with other Python packages. Some people have already made some useful contributions. It will be updated if more things are confirmed.

Continue reading “Release of shorttext 0.3.3”

Python Package for Short Text Mining

There has been a lot of methods for natural language processing and text mining. However, in tweets, surveys, Facebook, or many online data, texts are short, lacking data to build enough information. Traditional bag-of-words (BOW) model gives sparse vector representation.

Semantic relations between words are important, because we usually do not have enough data to capture the similarity between words. We do not want “drive” and “drives,” or “driver” and “chauffeur” to be completely different.

The relation between or order of words become important as well. Or we want to capture the concepts that may be correlated in our training dataset.

We have to represent these texts in a special way and perform supervised learning with traditional machine learning algorithms or deep learning algorithms.

This package `shorttext‘ was designed to tackle all these problems. It is not a completely new invention, but putting everything known together. It contains the following features:

  • example data provided (including subject keywords and NIH RePORT);
  • text preprocessing;
  • pre-trained word-embedding support;
  • gensim topic models (LDA, LSI, Random Projections) and autoencoder;
  • topic model representation supported for supervised learning using scikit-learn;
  • cosine distance classification; and
  • neural network classification (including ConvNet, and C-LSTM).

Readers can refer this to the documentation.

Continue reading “Python Package for Short Text Mining”

LDA2Vec: a hybrid of LDA and Word2Vec

Both LDA (latent Dirichlet allocation) and Word2Vec are two important algorithms in natural language processing (NLP). LDA is a widely used topic modeling algorithm, which seeks to find the topic distribution in a corpus, and the corresponding word distributions within each topic, with a prior Dirichlet distribution. Word2Vec is a vector-representation model, trained from RNN (recurrent neural network), to seek a continuous representation for words.

They are both very useful, but LDA deals with words and documents globally, and Word2Vec locally (depending on adjacent words in the training data). A LDA vector is so sparse that the users can interpret the topic easily, but it is inflexible. Word2Vec’s representation is not human-interpretable, but it is easy to use. In his slides, Chris Moody recently devises a topic modeling algorithm, called LDA2Vec, which is a hybrid of the two, to get the best out of the two algorithms.

Honestly, I never used this algorithm. I rarely talk about something I didn’t even try, but I want to raise awareness so that more people know about it when I come to use it. To me, it looks like concatenating two vectors with some hyperparameters, but  the source codes rejects this claim. It is a topic model algorithm.

There are not many blogs or papers talking about LDA2Vec yet. I am looking forward to learning more about it when there are more awareness.

lda2vec
Jupyter Notebook for LDA2Vec Demonstration [link]
Continue reading “LDA2Vec: a hybrid of LDA and Word2Vec”

Talking Not So Deep About Deep Learning

rnn

On October 14, 2015, I attended the regular meeting of the DCNLP meetup group, a group on natural language processing (NLP) in Washington, DC area. The talk was titled “Deep Learning for Question Answering“, spoken by Mr. Mohit Iyyer, a Ph.D. student in Department of Computer Science, University of Maryland (my alma mater!). He is a very good speaker.

I have no experience on deep learning at all although I did write a blog post remotely related. I even didn’t start training my first neural network until the next day after the talk. However, Mr. Iyyer explained what recurrent neural network (RNN), recursive neural network, and deep averaging network (DAN) are. This helped me a lot in order to understanding more about the principles of the famous word2vec model (which is something I am going to write about soon!). You can refer to his slides for more details. There are really a lot of talents in College Park, like another expert, Joe Yue Hei Ng, who is exploiting deep learning a lot as well.

The applications are awesome: with external knowledge to factual question answering, reasoning-based question answering, and visual question answering, with increasing order of challenging levels.

Mr. Iyyer and the participants discussed a lot about different packages. Mr. Iyyer uses Theano, a Python package for deep learning, which is good for model building and other analytical work. Some prefer Caffe. Some people, who are Java developers, also use deeplearning4j.

Stetsons Famous Bar & Grill (photo from Yelp)

This meetup was a sacred one too, because it is the last time it was held in Stetsons Famous Bar & Grill at U Street, which is going to permanently close on Halloween this year. The group is eagerly looking for a new venue for the upcoming meetup. This meeting was a crowded one. I sincerely thank the organizers, Charlie Greenbacker and Liz Merkhofer, for hosting all these meetings, and Chris Phipps (a linguist from IBM Watson) for recording.

IMG_20151014_191336IMG_20151014_191306

Continue reading “Talking Not So Deep About Deep Learning”

Lyrics Generation

Eminem (taken from web)

When I saw the “standardized” style of writing written in the classic book “Elements of Style” written by William Strunk, I have been wondering if the style of writing can be programmable. And now, with artificial intelligence, people can write automated codes that generate lyrics. In a paper written by Eric Malmi and his collaborators [Malmi, Takala, Toivonen, Raiko, Gionis 2015], the system DopeLearning, which generates rap lyrics to certain complexities, was introduced. It applies two machine learning techniques, namely the RankSVM and the deep neural network. It is fascinating that automated codes can be creative to produce complex artistic things, as the abstract says:

Writing rap lyrics requires both creativity, to construct a meaningful and an interesting story, and lyrical skills, to produce complex rhyme patterns, which are the cornerstone of a good flow.

What does DopeLearning produce? See the example the paper gives:

For a chance at romance I would love to enhance (Big Daddy Kane – The Day You’re Mine)
But everything I love has turned to a tedious task (Jedi Mind Tricks – Black Winter Day)
One day we gonna have to leave our love in the past (Lil Wayne – Marvin’s Room)
I love my fans but no one ever puts a grasp (Eminem – Say Goodbye Hollywood)
I love you momma I love my momma – I love you momma (Snoop Dogg – I Love My Momma)
And I would love to have a thing like you on my team you take care (Ghostface Killah – Paragraphs Of Love)
I love it when it’s sunny Sonny girl you could be my Cher (Common – Make My Day)
I’m in a love affair I can’t share it ain’t fair (Snoop Dogg – Show Me Love)
Haha I’m just playin’ ladies you know I love you. (Eminem – Kill You)
I know my love is true and I know you love me too (Everlast – On The Edge)
Girl I’m down for whatever cause my love is true (Lil Wayne – Sean Kingston I’m At War)
This one goes to my man old dirty one love we be swigging brew (Big Daddy Kane – Entaprizin)
My brother I love you Be encouraged man And just know (Tech N9ne – Need More Angels)
When you done let me know cause my love make you be like WHOA (Missy Elliot – Dog In Heat)
If I can’t do it for the love then do it I won’t (KRS One – Take It To God)
All I know is I love you too much to walk away though (Eminem – Love The Way You Lie)

There are similar work for Chinese Mandopop, using RNN. Chinese readers can refer to this blog post: http://phunters.lofter.com/post/86d56_732209b.

Continue reading “Lyrics Generation”

Create a free website or blog at WordPress.com.

Up ↑