Data Representation in Machine Learning

In implementing most of the machine learning algorithms, we represent each data point with a feature vector as the input. A vector is basically an array of numerics, or in physics, an object with magnitude and direction. How do we represent our business data in terms of a vector?

Primitive Feature Vector

Whether the data are measured observations, or images (pixels), free text, factors, or shapes, they can be categorized into four following types:

  1. Categorical data
  2. Binary data
  3. Numerical data
  4. Graphical data

The most primitive representation of a feature vector looks like this:

Screen Shot 2019-09-15 at 3.58.09 PM
A typical feature vector. (Source: https://www.researchgate.net/publication/318740904_Chat_Detection_in_an_Intelligent_Assistant_Combining_Task-oriented_and_Non-task-oriented_Spoken_Dialogue_Systems/figures?lo=1)

Numerical Data

Numerical data can be represented as individual elements above (like Tweet GRU, Query GRU), and I am not going to talk too much about it.

Categorical Data

However, for categorical data, how do we represent them? The first basic way is to use one-hot encoding:

Screen Shot 2019-09-15 at 4.02.51 PM
One-hot encoding of categorical data (Source: https://developers.google.com/machine-learning/data-prep/transform/transform-categorical)

For each type of categorical data, each category has an integer code. In the figure above, each color has a code (0 for red, 1 for orange etc.) and they will eventually be transformed to the feature vector on the right, with vector length being the total number of categories found in the data, and the element will be filled with 1 if it is of that category. This allows a natural way of dealing with missing data (with all elements 0) and multi-category (with multiple non-zeros).

In natural language processing, the bag-of-words model is often used to represent free-text data, which is the one-hot encoding above with words as the categories. It is a good way as long as the order of the words does not matter.

Binary Data

For binary data, it can be easily represented by one element, either 1 or 0.

Graphical Data

Graphical data are best represented in terms of graph Laplacian and adjacency matrix. Refer to a previous blog article for more information.

Shortcomings

A feature vector can be a concatenation of various features in terms of all these types except graphical data.

However, such representation that concatenates all the categorical, binary, and numerical fields has a lot of shortcomings:

  1. Data with different categories are often seen as orthogonal, i.e., perfectly dissimilar.  It ignores the correlation between different variables. However, it is a very big assumption.
  2. The weights of different fields are not considered.
  3. Sometimes if the numerical values are very large, it outweighs other categorical data in terms of influence in computation.
  4. Data are very sparse, costing a lot of memory waste and computing time.
  5. It is unknown whether some of the data are irrelevant.

Modifying Feature Vectors

In light of the shortcomings, to modify the feature factors, there are three main ways of dealing with this:

  1. Rescaling: rescaling all of some of the elements, or reweighing, to adjust the influence from different variables.
  2. Embedding: condensing the information into vectors of smaller lengths.
  3. Sparse coding: deliberately extend the vectors to a larger length.

Rescaling

Rescaling means rescaling all or some of the elements in the vectors. Usually there are two ways:

  1. Normalization: normalizing all the categories of one feature to having the sum of 1.
  2. Term frequency-inverse document frequency (tf-idf): weighing the elements so that the weights are heavier if the frequency is higher and it appears in relatively few documents or class labels.

Embedding

Embedding means condensing a sparse vector to a smaller vector. Many sparse elements disappear and information is encoded inside the elements. There are rich amount of work on this.

  1. Topic models: finding the topic models (latent Dirichlet allocation (LDA),  structural topic models (STM) etc.) and encode the vectors with topics instead;
  2. Global dimensionality reduction algorithms: reducing the dimensions by retaining the principal components of the vectors of all the data, e.g., principal component analysis (PCA), independent component analysis (ICA), multi-dimensional scaling (MDS) etc;
  3. Local dimensionality reduction algorithms: same as the global, but these are good for finding local patterns, where examples include t-Distributed Stochastic Neighbor Embedding (tSNE) and Uniform Manifold Approximation and Projection (UMAP);
  4. Representation learned from deep neural networks: embeddings learned from encoding using neural networks, such as auto-encoders, Word2Vec, FastText, BERT etc.
  5. Mixture Models: Gaussian mixture models (GMM), Dirichlet multinomial mixture (DMM) etc.
  6. Others: Tensor decomposition (Schmidt decomposition, Jennrich algorithm etc.), GloVe etc.

Sparse Coding

Sparse coding is good for finding basis vectors for dense vectors.

Continue reading “Data Representation in Machine Learning”

R or Python on Text Mining

rpythontextmining

I have seen more than enough debates about R or Python. While I do have a preference towards Python, I am happy with using R as well. I am not agnostic about languages, but we choose tools according to needs. The needs may be about effectiveness, efficiency, availability of tools, nature of problems, collaborations, etc. Yes, in a nutshell, it depends.

When dealing with text mining, although I still prefer Python, I have to fairly say that both languages have their own strengths and weaknesses. What do you do in text mining? Let me casually list the usual steps:

  1. Removing special characters,
  2. Removing numerals,
  3. Converting all alphabets to lower cases,
  4. Removing stop words, and
  5. Stemming the words (using Porter stemmer).

They are standard steps. But of course, sometimes we perform lemmatization instead of stemming. Sometimes we keep numerals. Or whatever. It is okay.

How do u do that in Python? Suppose you have a list of text documents stored in the variable texts, which is defined by

texts = ['I love Python.',
         'R is good for analytics.',
         'Mathematics is fun.']

. Then

# import all necessary libraries
from nltk.stem import PorterStemmer
from nltk.tokenize import SpaceTokenizer
from nltk.corpus import stopwords
from functools import partial
from gensim import corpora
from gensim.models import TfidfModel
import re

# initialize the instances for various NLP tools
tokenizer = SpaceTokenizer()
stemmer = PorterStemmer()

# define each steps
pipeline = [lambda s: re.sub('[^\w\s]', '', s),
            lambda s: re.sub('[\d]', '', s),
            lambda s: s.lower(),
            lambda s: ' '.join(filter(lambda s: not (s in stopwords.words()), tokenizer.tokenize(s))),
            lambda s: ' '.join(map(lambda t: stemmer.stem(t), tokenizer.tokenize(s)))
           ]

# function that carries out the pipeline step-by-step
def preprocess_text(text, pipeline):
    if len(pipeline)==0:
        return text
    else:
        return preprocess_text(pipeline[0](text), pipeline[1:])

# preprocessing
preprocessed_texts = map(partial(preprocess_text, pipeline=pipeline), texts)

# converting to feature vectors
documents = map(lambda s: tokenizer.tokenize(s), texts)
corpus = [dictionary.doc2bow(document) for document in documents]
tfidfmodel = TfidfModel(corpus)

We can train a classifier with the feature vectors output by tfidfmodel. To do the prediction, we can get the feature vector for a new text by calling:

bow = dictionary.doc2bow(tokenizer.tokenize(preprocess_text(text, pipeline)))

How about in R? To perform the preprocessing steps and extract the feature vectors, run:

library(RTextTools)
library(tm)

origmatrix<-create_matrix(textColumns = texts, language = 'english',
                          removeNumbers = TRUE, toLower = TRUE,
                          removeStopwords = 'TRUE', stemWords = TRUE,
                          weighting=tm::weightTfIdf, originalMatrix=NULL)

After we have a trained classifier, and we have a new text to preprocess, then we run:

matrix<-create_matrix(textColumns = newtexts, language = 'english',
                      removeNumbers = TRUE, toLower = TRUE,
                      removeStopwords = 'TRUE', stemWords = TRUE,
                      weighting=tm::weightTfIdf, originalMatrix=origmatrix)

Actually, from this illustration, a strength for R stands out: brevity. However, very often we want to preprocess in other ways, Python allows more flexibility without making it complicated. And Python syntax itself is intuitive enough.

And there are more natural language processing libraries in Python available, such as nltk and gensim, that are associated with its other libraries perfectly such as numpy, scipy and scikit-learn. But R is not far away in terms of this actually, as it has libraries such as tm and RTextTools, while R does not have numpy-like libraries because R itself is designed to perform calculations like this.

Python can be used to develop larger software projects by making the codes reusable, and it is obviously a weakness for R.

However, do perform analysis, R makes the task very efficient if we do not require something unconventional.

In the area of text mining, R or Python? My answer is: it depends.

Continue reading “R or Python on Text Mining”

Blog at WordPress.com.

Up ↑