In text mining, it is important to create the document-term matrix (DTM) of the corpus we are interested in. A DTM is basically a matrix, with documents designated by rows and words by columns, that the elements are the counts or the weights (usually by tf-idf). Subsequent analysis is usually based creatively on DTM.

Exploring with DTM therefore becomes an important issues with a good text-mining tool. How do we perform exploratory data analysis on DTM using R and Python? We will demonstrate it using the data set of U. S. Presidents’ Inaugural Address, preprocessed, and can be downloaded here.

# R: textmineR

In R, we can use the package textmineR, which has been in introduced in a previous post. Together with other packages such as dplyr (for tidy data analysis) and snowBall (for stemming), load all of them at the beginning:

library(dplyr)
library(textmineR)
library(SnowballC)


usprez.df<- read.csv('inaugural.csv', stringsAsFactors = FALSE)


Then we create the DTM, while we remove all digits and punctuations, make all letters lowercase, and stem all words using Porter stemmer.

dtm<- CreateDtm(usprez.df$speech, doc_names = usprez.df$yrprez,
ngram_window = c(1, 1),
lower = TRUE,
remove_punctuation = TRUE,
remove_numbers = TRUE,
stem_lemma_function = wordStem)


Then defining a set of functions:

get.doc.tokens<- function(dtm, docid)
dtm[docid, ] %>% as.data.frame() %>% rename(count=".") %>%
mutate(token=row.names(.)) %>% arrange(-count)

get.token.occurrences<- function(dtm, token)
dtm[, token] %>% as.data.frame() %>% rename(count=".") %>%
mutate(token=row.names(.)) %>% arrange(-count)

get.total.freq<- function(dtm, token) dtm[, token] %>% sum

get.doc.freq<- function(dtm, token)
dtm[, token] %>% as.data.frame() %>% rename(count=".") %>%
filter(count>0) %>% pull(count) %>% length


Then we can happily extract information. For example, if we want to get the top-most common words in 2009’s Obama’s speech, enter:

dtm %>% get.doc.tokens('2009-Obama') %>% head(10)


Or which speeches have the word “change”: (but need to stem the word before extraction)

dtm %>% get.token.occurrences(wordStem('change')) %>% head(10)


You can also get the total number of occurrence of the words by:

dtm %>% get.doc.freq(wordStem('change'))   # gives 28


# Python: shorttext

In Python, similar things can be done using the package shorttext, described in a previous post. It uses other packages such as pandas and stemming. Load all packages first:

import shorttext
import numpy as np
import pandas as pd
from stemming.porter import stem

import re


And define the preprocessing pipelines:

pipeline = [lambda s: re.sub('[^\w\s]', '', s),
lambda s: re.sub('[\d]', '', s),
lambda s: s.lower(),
lambda s: ' '.join(map(stem, shorttext.utils.tokenize(s)))
]
txtpreproceesor = shorttext.utils.text_preprocessor(pipeline)


The function <code>txtpreprocessor</code> above perform the functions we talked about in R.

usprezdf = pd.read_csv('inaugural.csv')


The corpus needs to be preprocessed before putting into the DTM:

docids = list(usprezdf['yrprez'])    # defining document IDs
corpus = [txtpreproceesor(speech).split(' ') for speech in usprezdf['speech']]


Then create the DTM:

dtm = shorttext.utils.DocumentTermMatrix(corpus, docids=docids, tfidf=False)


Then we do the same thing as we have done above. To get the top-most common words in 2009’s Obama’s speech, enter:

dtm.get_doc_tokens('2009-Obama')


Or we look up which speeches have the word “change”:

dtm.get_token_occurences(stem('change'))


Or to get the document frequency of the word:

dtm.get_doc_frequency(stem('change'))


They Python and R codes give different document frequencies probably because the two stemmers work slightly differently.

Recently, gensim, a Python package for topic modeling, released a new version of its package which includes the implementation of author-topic models.

The most famous topic model is undoubtedly latent Dirichlet allocation (LDA), as proposed by David Blei and his colleagues. Such a topic model is a generative model, described by the following directed graphical models:

In the graph, $\alpha$ and $\beta$ are hyperparameters. $\theta$ is the topic distribution of a document, $z$ is the topic for each word in each document, $\phi$ is the word distributions for each topic, and $w$ is the generated word for a place in a document.

There are models similar to LDA, such as correlated topic models (CTM), where $\phi$ is generated by not only $\beta$ but also a covariance matrix $\Sigma$.

There exists an author model, which is a simpler topic model. The difference is that the words in the document are generated from the author for each document, as in the following graphical model. $x$ is the author of a given word in the document.

Combining these two, it gives the author-topic model as a hybrid, as shown below:

The new release of Python package, gensim, supported the author-topic model, as demonstrated in this Jupyter Notebook.

P.S.:

• I am also aware that there is another topic model called structural topic model (STM), developed for the field of social science. However, there is no Python package supporting this, but an R package, called stm, is available for it. You can refer to their homepage too.
• I may consider including author-topic model and STM in the next release of the Python package shorttext.

Today is the presidential election.

Regardless of the dirty things, we can do some simple simulation about the election. With the electoral college data, and the poll results from various sources, simple simulation can be performed.

Look at this sophisticated model in R: http://blog.yhat.com/posts/predicting-the-presidential-election.html

(If I have time after the election, I will do the simulation too…)

While performing exploratory analysis, R is a good tool, although we sometimes want to invoke some stable Java tools. It is what the R Package rJava is for. To install it, simply enter on the R Console:

install.packages('rJava')


library(rJava)


As a simple demonstration, we find the length of a strength. Start the JVM, enter:

.jinit('.')


Then we create an instance of a Java string, and find its length as follow:

s <- .jnew('java/lang/String', 'Hello World!')
.jcall(s, 'I', 'length')


The first line, with the function .jnew, create a Java string instance. It is safe to put the full package path of the class. The second line, with the function .jcall, call the method length() for String. The second parameter, ‘I’, indicates it returns an integer. The type has to follow the JNI notation for native types. If it is an integer double array, type ‘I[[‘. If it is not a native class like String, use its total package path.

# Example: Peter Norvig’s Spell Corrector Written in Scala

What should we do if we already have a .jar file we want to wrap? I would start with a simple one. Two years ago, I implemented Peter Norvig’s spell corrector (see his article) in Scala (which is a language for Java Virtual Machine (JVM) as well, see this entry), and posted on my Github repository: stephenhky/SpellCorrector. You may check out to your Eclipse or IntelliJ IDEA, and build a .jar file. (Or you can download the .jar file here.) For the program to run, do not forget to download his corpus named big.txt. The project has a class called SpellCorrector, which only the necessary codes are listed below:

package home.kwyho.spellcheck

/*
Reference: http://norvig.com/spell-correct.html
*/

import java.io.File
import scala.io.Source
import scala.collection.mutable.Map

class SpellCorrector {
var wordCounts : Map[String, Int] = Map()
val alphabets = ('a' to 'z').toSet

def train(trainFile : File) = {
val lines = Source.fromFile(trainFile) mkString
val wordREPattern = "[A-Za-z]+"
wordREPattern.r.findAllIn(lines).foreach( txtWord => {
val word = txtWord.toLowerCase
if (wordCounts.keySet contains(word)) {
wordCounts(word) = wordCounts(word)+1
} else {
wordCounts += (word -> 1)
}
})
}

// other codes here ....

def correct(wrongSpelling: String) : String = {
val edit0words = Set(wrongSpelling) intersect wordCounts.keySet
if (edit0words.size>0) return edit0words.maxBy( s => wordCounts(s))
val edit1words = getEditOneSpellings(wrongSpelling)
if (edit1words.size>0) return edit1words.maxBy( s => wordCounts(s))
val edit2words = getEditTwoSpellings(wrongSpelling)
edit2words.maxBy( s => wordCounts(s))
}
}


Putting the .jar file and big.txt into the same folder. Then initialize the JVM, and add the .jar file into the classpath:

.jinit('.')


Create an instance for SpellChecker, and train the corpus big.txt. Remember to put the whole package path as the class:

corrector <- .jnew('home/kwyho/spellcheck/SpellCorrector')
bigfile <- .jnew('java/io/File', 'big.txt')
.jcall(corrector, 'V', 'train', bigfile)


The first line create a SpellChecker instance, the second line create a File instance for big.txt, and the third line call the train() method. The JNI notation ‘V’ denotes ‘void.’ Entering ‘corrector’ will give a string indicates it is a Java object:

[1] "Java-Object{home.kwyho.spellcheck.SpellCorrector@5812f9ee}"


Then we can do spell correction by designing the following function:

correct<-function(word) {
javaStrtext <- .jnew('java/lang/String', word)
.jcall(corrector, 'Ljava/lang/String;', 'correct', javaStrtext)
}


Then you can easily perform spell correction as follow:

Some people put .class file instead of .jar file. In that case, you need to put the compiled Java class into the working directory. You can refer to an entry in Darren Wilkinson’s research blog for more details.

# Building an R Package

It is another matter to build an R package that wraps a .jar file. In Hilary Parker’s entry and my previous entry, there are details about building an R package with roxygen2. There is also a documentation written by Tobias Verbeke.

So to start building it, in RStudio, start a project by clicking on the button “Project: (None)” on the top right corner of RStudio, choose “New Directory,” and then “R Package.” Type in the name (“RSpellCorrection” here), and specify a directory. Then click “Create Project.” A new RStudio window will show up. From the menu bar, choose “Build” > “Configure Build Tools”. Then click on “Configure…” button. There is a dialog box coming out. Check everything, and click “OK”.

The instructions above are rather detailed. But starting from now, I will skip the procedural details. Then start a file named, say, onLoad.R under the subfolder R/, and put the following codes there:

.onLoad <- function(libname, pkgname) {
.jpackage(pkgname, lib.loc=libname)
}


This is a hook function that R will call when this package is being loaded. You must include it. Then in the file named DESCRIPTION, put in the relevant information:

Package: RSpellCorrection
Type: Package
Title: Spell Correction, Scala implementation run in R
Version: 0.1.0
Author: Kwan-Yuet Ho, Ph.D.
Maintainer: Kwan-Yuet Ho, Ph.D. <stephenhky@yahoo.com.hk>
Description: Implementation of Peter Norvig's spell corrector in Scala, wrapped in R
LazyData: TRUE
RoxygenNote: 5.0.1
Depends: R(>= 2.7.0), rJava (>= 0.5-0)

Note the last line (“Depends…”), which you have to include because R will parse this line, and load rJava automatically. Remember there is a space between “>=” and the version number. Do not use library function in your code.

First, create a subfolder inst/java, and put the .jar file there.

Then start a file, called correct.R under subfolder R/, and write a function:

#' Retrieve a Java instance of SpellCorrector.
#'
#' Retrieve a Java instance of SpellCorrector, with the training file
#' specified. Language model is trained before the instance is returned.
#' The spell corrector is adapted from Peter Norvig's demonstration.
#'
#' @param filepath Path of the corpus.
#' @return a Java instance of SpellCorrector
#' @export
getcorrector<-function(filepath='big.txt') {
corrector<- .jnew('home/kwyho/spellcheck/SpellCorrector')
bigfile<- .jnew('java/io/File', filepath)
.jcall(corrector, 'V', 'train', bigfile)
return(corrector)
}


This return a Java instance of SpellCorrector as in previous section. There is a large block of text above the function, and they are for producing manual using roxygen2. The tag “@export” is important to tell roxygen2 to make this function visible to the users.

#' Correct spelling.
#'
#' Given an instance of SpellCorrector, return the most probably
#' corrected spelling of the given word.
#'
#' @param word A token.
#' @param corrector A Java instance of SpellCorrector, given by \code{getcorrector}.
#' @return Corrected spelling
#' @export
correct<-function(word, corrector) {
javaStrtext <- .jnew('java/lang/String', word)
.jcall(corrector, 'Ljava/lang/String;', 'correct', javaStrtext)
}


Then click “Build & Reload” button on the “Build” Tab:

Then the package will be built, and reloaded. The manual documents (*.Rd) will be produced as well. You can then play with the spell corrector again like this:

Assuming you put this into the Github repository like I did (link here), you can install the new R package like this:

library(devtools)
install_github('stephenhky/RSpellCorrection')


Then the R package will be downloaded, and installed for use. Or another option is that if you wish to install from your local directory, just enter:

install.packages('<path-to>/RSpellCorrection', repos = NULL, type = 'source')


A complete version of this R package can be found in my Github repository: stephenhky/RSpellCorrection. You may want to add a README.md into the repository, which you need to know the Markdown language by referring to Lei Feng’s blog entry.

Because of work, I developed two R packages to host the functions that I used a lot. It did bring me a lot of convenience, such as that I don’t have to start my data analysis in a particular folder and switch later on.

To do that, you need to use RStudio. Then you have to install devtools package by calling in the R console:

install.packages('devtools')


and load it by simply call:

library(devtools)


And then you have to install the roxygen2 package by calling:

install_github("klutometis/roxygen")
library(roxygen2)


There are a lot of good tutorials about writing an R package. I especially like this Youtube video clip about building an R package with RStudio and roxygen2:

And Hilary Parker’s blog entry is useful as well.

On the other hand, if you are publishing your R package onto your Github repository, it would be nice to include a README file introducing your work. You need to know the Markdown language to write the file named README.md, and put it onto the root folder of your repository. My friend, Qianli Deng, showed me this Lei Feng’s blog entry, which I found extremely useful. Markdown is remarkably simpler than LaTeX.

The topic of word embedding algorithms has been one of the interests of this blog, as in this entry, with Word2Vec [Mikilov et. al. 2013] as one of the main examples. It is a great tool for text mining, (for example, see [Czerny 2015],) as it reduces the dimensions needed (compared to bag-of-words model). As an algorithm borrowed from computer vision, a lot of these algorithms use deep learning methods to train the model, while it was not exactly sure why it works. Despite that, there are many articles talking about how to train the model. [Goldberg & Levy 2014, Rong 2014 etc.] Addition and subtraction of the word vectors show amazing relationships that carry semantic values, as I have shown in my previous blog entry. [Ho 2015]

However, Tomas Mikolov is no longer working in Google, making the development of this algorithm discontinued. As a follow-up of their work, Stanford NLP group later proposed a model, called GloVe (Global Vectors), that embeds word vectors using probabilistic methods. [Pennington, Socher & Manning 2014] It can be implemented in the package glove-python in python, and text2vec in R (or see their CRAN post).  Their paper is neatly written, and a highly recommended read.

To explain the theory of GloVe, we start with some basic probabilistic picture in basic natural language processing (NLP). We suppose the relation between the words occur in certain text windows within a corpus, but the details are not important here. Assume that $i$, $j$, and $k$ are three words, and the conditional probability $P_{ik}$ is defined as

$P_{ij} = P(j | i) = \frac{X_{ij}}{X_i}$,

where $X$‘s are the counts, and similarly for $P_{jk}$. And we are interested in the following ratio:

$F(w_i, w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$.

The tilde means “context,” but we will later assume it is also a word. Citing the example from their paper, take $i$ as ice, and $j$ as steam. if $k$ is solid, then the ratio is expected to be large; or if $k$ is gas, then it is expected to be low. But if $k$ is water, which are related to both, or fashion, which is related to none, then the ratio is expected to be approximately 1.

And the addition and subtraction in Word2Vec is similar to this. We want the ratio to be like the subtraction as in Word2Vec (and multiplication as in addition), then we should modify the function $F$ such that,

$F(w_i - w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$.

On the other hand, the input arguments of $F$ are vectors, but the output is a scalar. We avoid the issue by making the input argument as a dot product,

$F( (w_i - w_j)^T \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}$.

In NLP, the word-word co-occurrence matrices are symmetric, and our function $F$ should also be invariant under switching the labeling. We first require $F$ is be a homomorphism,

$F((w_i - w_j)^T \tilde{w}_k) = \frac{F(w_i^T \tilde{w}_k) }{ F(w_j^T \tilde{w}_k)}$,

where we define,

$F(w_i^T \tilde{w}_k) = P_{ik} = \frac{X_{ik}}{X_i}$.

It is clear that $F$ is an exponential function, but to ensure symmetry, we further define:

$w_i^T \tilde{w}_k + b_i + \tilde{b}_k = \log X_{ik}$.

As a result of this equation, the authors defined the following cost function to optimize for GloVe model:

$J = \sum_{i, j=1}^V f(X_{ij}) \left( w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ik} \right)^2$,

where $w_j$, $\tilde{w}_j$, $b_i$, and $\tilde{b}_j$ are parameters to learn. $f(x)$ is a weighting function. (Refer the details to the paper.) [Pennington, Socher & Manning 2014]

As Radim Řehůřek said in his blog entry, [Řehůřek 2014] it is a neat paper, but their evaluation is crappy.

This theory explained why certain similar relations can be achieved, such as Paris – France is roughly equal to Beijing – China, as both can be transformed to the ratio in the definition of $F$ above.

It is a neat paper, as it employs optimization theory and probability theory, without any dark box deep learning.

Previously, I wrote an entry on text mining on R and Python, and did a comparison. However, the text mining package employed was tm for R. But it has some problems:

1. The syntax is not natural for an experienced R users.
2. tm uses simple_triplet_matrix from the slam library for document-term matrix (DTM) and term-occurrence matrix (TCM), which is not as widely used as dgCMatrix from the Matrix library.

Tommy Jones, a Ph.D. student in George Mason University, and a data scientist at Impact Research, developed an alternative text mining package called textmineR. He presented in a Stat Prog DC Meetup on April 27, 2016. It employed a better syntax, and dgCMatrix. All in all, it is a wrapper for a lot of existing R packages to facilitate the text mining process, like creating DTM matrices with stopwords or appropriate stemming/lemmatizing functions. Here is a sample code to create a DTM with the example from the previous entry:

library(tm)
library(textmineR)

texts <- c('I love Python.',
'R is good for analytics.',
'Mathematics is fun.')

dtm<-CreateDtm(texts,
doc_names = c(1:length(texts)),
ngram_window = c(1, 1),
stopword_vec = c(tm::stopwords('english'), tm::stopwords('SMART')),
lower = TRUE,
remove_punctuation = TRUE,
remove_numbers = TRUE
)


The DTM is a sparse matrix:

3 x 6 sparse Matrix of class &amp;quot;dgCMatrix&amp;quot;
analytics fun mathematics good python love
1         .   .           .    .      1    1
2         1   .           .    1      .    .
3         .   1           1    .      .    .


On the other hand, it wraps text2vec, an R package that wraps the word-embedding algorithm named gloVe. And it wraps a number of topic modeling algorithms, such as latent Dirichlet allocation (LDA) and correlated topic models (CTM).

In addition, it contains a parallel computing loop function called TmParallelApply, analogous to the original R parallel loop function mclapply, but TmParallelApply works on Windows as well.

textmineR is an open-source project, with source code available on github, which contains his example codes.

Embedding has been hot in recent years partly due to the success of Word2Vec, (see demo in my previous entry) although the idea has been around in academia for more than a decade. The idea is to transform a vector of integers into continuous, or embedded, representations. Keras, a Python package that implements neural network models (including the ANN, RNN, CNN etc.) by wrapping Theano or TensorFlow, implemented it, as shown in the example below (which converts a vector of 200 features into a continuous vector of 10):

from keras.layers import Embedding
from keras.models import Sequential

# define and compile the embedding model
model = Sequential()
model.compile('rmsprop', 'mse')  # optimizer: rmsprop; loss function: mean-squared error


We can then convert any features from 0 to 199 into vectors of 20, as shown below:

import numpy as np

model.predict(np.array([10, 90, 151]))


It outputs:

array([[[ 0.02915354,  0.03084954, -0.04160764, -0.01752155, -0.00056815,
-0.02512387, -0.02073313, -0.01154278, -0.00389587, -0.04596512]],

[[ 0.02981793, -0.02618774,  0.04137352, -0.04249889,  0.00456919,
0.04393572,  0.04139435,  0.04415271,  0.02636364, -0.04997493]],

[[ 0.00947296, -0.01643104, -0.03241419, -0.01145032,  0.03437041,
0.00386361, -0.03124221, -0.03837727, -0.04804075, -0.01442516]]])


Of course, one must not omit a similar algorithm called GloVe, developed by the Stanford NLP group. Their codes have been wrapped in both Python (package called glove) and R (library called text2vec).

Besides Word2Vec, there are other word embedding algorithms that try to complement Word2Vec, although many of them are more computationally costly. Previously, I introduced LDA2Vec in my previous entry, an algorithm that combines the locality of words and their global distribution in the corpus. And in fact, word embedding algorithms with a similar ideas are also invented by other scientists, as I have introduced in another entry.

However, there are word embedding algorithms coming out. Since most English words carry more than a single sense, different senses of a word might be best represented by different embedded vectors. Incorporating word sense disambiguation, a method called sense2vec has been introduced by Trask, Michalak, and Liu. (arXiv:1511.06388). Matthew Honnibal wrote a nice blog entry demonstrating its use.

There are also other related work, such as wang2vec that is more sensitive to word orders.

Big Bang Theory (Season 2, Episode 5): Euclid Alternative

DMV staff: Application?
Sheldon: I’m actually more or a theorist.

Note: feature image taken from Big Bang Theory (CBS).

Previously, I have went through heuristically the description of topology using homology groups in this entry. [Ho 2015] This is the essence of algebraic topology. We describe the topology using Betti numbers, the rank of the homolog groups. What they mean can be summarized as: [Bubenik 2015]

“… homology in degree 0 describes the connectedness of the data; homology in degree 1 detects holes and tunnels; homology in degree 2 captures voids; and so on.

## Concept of Persistence

However, in computational problems, it is the discrete points that we are dealing with. We formulate their connectedness through constructing complexes, as described by my another blog entry. [Ho 2015] From the Wolfram Demonstration that I quoted previously, connectedness depends on some parameters, such as the radii of points that are considered connected. Whether it is Čech Complex, RP complex, or Alpha complex, the idea is similar. With discrete data, therefore, there is no definite answer how the connectedness among the points are, as it depends on the parameters.

Therefore, the concept of persistence has been developed to tackle this problem. This is the core concept for computational topology. There are a lot of papers about persistence, but the most famous work is done by Zomorodian and Carlsson, who algebraically studied it. [Zomorodian & Carlsson 2005] The idea is that as one increases the radii of points, the complexes change, and so do the homology groups. By varying the radii, we can observe which topology persists.

From the diagram above, we can see that as the radii ε increase, the diagram becomes more connected. To understand the changes of homologies, there are a few ways. In the diagram above, barcodes represent the “life span” of a connected component as ε increases. The Betti numbers of a certain degree (0, 1, or 2 in this example) at a certain value of ε is the number of barcodes at that degree. For example, look at the left most vertical dashed line, $\beta_0=10$, as there are 10 barcodes existing for $H_0$. Note there are indeed 10 separate connected components. For the second leftmost vertical dashed line, $\beta_0=6$ (6 connected components), and $\beta_1=2$ (2 holes).

Another way is using the persistence diagram, basically plotting the “birth” and “death” times of all the barcodes above. For an explanation of persistence diagram, please refer to this blog entry by Sebastien Bubeck, [Bubeck 2013] or the paper by Fasy et. al. [Fasy et. al. 2014] Another way to describe persistent topology is the persistence landscape. [Bubenik 2015]

## TDA Package in R

There are a lot of tools to perform topological data analysis. Ayasdi Core is a famous one. There are open sources C++ libraries such as Dionysus, or PHAT. There is a Python binding for Dionysus too.

There is a package in R that wraps Dionysus and PHAT, called TDA. To install it, simply open an R session, and enter

install.package('TDA')


library(TDA)


We know that for a circle, $\beta_0=\beta_1=1$, as it has on connected components, and a hole. Prepare the circle and store it in X by the function circleUnif:

X<- circleUnif(n=1000, r=1)
plot(X)


Then we can see a 2-dimensional circle like this:

To calculate the persistent homology, use the function gridDiag:

diag.info<- gridDiag(X=X, FUN=kde, h=0.3, lim=cbind(c(-1, 1), c(-1, 1)), by=0.01, sublevel = FALSE, library = 'PHAT', printProgress=FALSE)


To plot the barcodes and persistence diagram, enter:

par(mfrow=c(2,1))
plot(diag.info$diagram) plot(diag.info$diagram, barcode=TRUE)


In the plots, black refers to degree 0, and red refers to degree 1.

We can play the same game by adding a horizontal line to cut the circle into two halves:

X<- circleUnif(n=1000, r=1)
hl<- matrix(c(seq(-1, 1, 0.02), rep(0, 201)), ncol=2)
X<- rbind(X, hl)
plot(X)


And the barcodes and persistence diagram are:

We can try this with three-dimensional objects like sphere, or torus, but I never finished the calculation in reasonable speeds.

There is no doubt that everyone who are in the so-called big data industry must know some statistics. However, statistics means differently to different peoples.

Statistics is an old field that was developed in the 18th century. In those times, people were urged to make conclusions out of a vast amount of data which were virtually not available, or were very costly to obtain. For example, someone wanted to know the average salary of the whole population, which required the census staff to survey the information from everyone in the population. It was something expensive to do in the old days. Therefore, sampling techniques were devised, and the wanted quantities can be estimated using an appropriate statistic.

Or when the scientists performed an experiment, even one data point costs a few million dollars. The experiments had to be designed in a way that the scientists extract the wanted information by looking at a few data points.

Or in testing some hypotheses, one needs to know only how to accept or reject a hypothesis using the statistical information available.

Hence, the traditional statistics is a body of knowledge that deduce the information of a whole population from a limited amount of data from a sample.

Theoretical Statistical Physics

There is a branch in physics called statistical physics, which originated from the 19th century. Later it became useful since Albert Einstein published its paper on Brownian motion in 1905. And now the methods in statistical physics is not only applied in solid state physics or condensed matter physics, but also in biophysics (e.g., diffusion), econophysics (e.g., the fairness and wealth distribution, see this previous blog post), and quantitative finance (e.g., binomial model, and its relation with Black-Scholes equation).

The techniques involved in statistical physics includes is the knowledge of probability theory and stochastic calculus (such as Ito calculus). Of course, it is how entropy, a concept from thermodynamics, entered probability theory and information theory. Extracted quantity are mostly expectation values and correlations, which are of interest to theorists.

This is very different from traditional statistics. When people know that I am a statistical physicist, they expect me to be familiar with t-test, which is not really the case. (Very often I have to look up every time I used them.)

Statistics in the Computing World

Unlike in traditional statistics or statistical physics, nowadays, we often get the statistical information directly from a vast amount of available data, thanks to the advance of technology and the reducing cost to access the technology. You can easily calculate the average salary of a population by a single command line on R or Python. Hence, statistics is no longer about extracting information from a limited amount of data, but a vast amount of data.

On the other hand, mathematical modeling is still important, but in a different sense. Models in statistical physics describes the world, but in information retrieval, models are built according to what we need.

P.S.: Philipp Janert wrote something similar in his Chapter 10 (“What You Really Need to Know About Classical Statistics”) in his “Data Analysis Using Open Source Tools“:

The basic statistical methods that we know today were developed in the late 19th and early 20th centuries, mostly in Great Britain, by a very small group of people. Of those, one worked for the Guinness brewing company and another—the most influential one of them—worked at an agricultural research lab (trying to increase crop yields and the like). This bit of historical context tells us something about their working conditions and primary challenges.

No computational capabilities All computations had to be performed with paper and pencil.

No graphing capabilities, either All graphs had to be generated with pencil, paper, and a ruler. (And complicated graphs—such as those requiring prior transformations or calculations using the data—were especially cumbersome.)

Very small and very expensive data sets Data sets were small (often not more than four to five points) and could be obtained only with great difficulty. (When it always takes a full growing season to generate a new data set, you try very hard to make do with the data you already have!)

In other words, their situation was almost entirely the opposite of our situation today:

• Computational power that is essentially free (within reason)
• Interactive graphing and visualization capabilities on every desktop
• Often huge amounts of data

It should therefore come as no surprise that the methods developed by those early researchers seem so out of place to us: they spent a great amount of effort and ingenuity solving problems we simply no longer have! This realization goes a long way toward explaining why classical statistics is the way it is and why it often seems so strange to us today.

P.S.: The graph at the beginning of this blog entry was plotted in Mathematica, by running the following:

Plot[Evaluate@Table[PDF[MaxwellDistribution[σ], x], {σ, {1, 2, 3}}], {x, 0, 10}, Filling -> Axis]