In text mining, it is important to create the document-term matrix (DTM) of the corpus we are interested in. A DTM is basically a matrix, with documents designated by rows and words by columns, that the elements are the counts or the weights (usually by tf-idf). Subsequent analysis is usually based creatively on DTM.
Exploring with DTM therefore becomes an important issues with a good text-mining tool. How do we perform exploratory data analysis on DTM using R and Python? We will demonstrate it using the data set of U. S. Presidents’ Inaugural Address, preprocessed, and can be downloaded here.
In R, we can use the package textmineR, which has been in introduced in a previous post. Together with other packages such as dplyr (for tidy data analysis) and snowBall (for stemming), load all of them at the beginning:
library(dplyr) library(textmineR) library(SnowballC)
Load the datasets:
usprez.df<- read.csv('inaugural.csv', stringsAsFactors = FALSE)
Then we create the DTM, while we remove all digits and punctuations, make all letters lowercase, and stem all words using Porter stemmer.
dtm<- CreateDtm(usprez.df$speech, doc_names = usprez.df$yrprez, ngram_window = c(1, 1), lower = TRUE, remove_punctuation = TRUE, remove_numbers = TRUE, stem_lemma_function = wordStem)
Then defining a set of functions:
get.doc.tokens<- function(dtm, docid) dtm[docid, ] %>% as.data.frame() %>% rename(count=".") %>% mutate(token=row.names(.)) %>% arrange(-count) get.token.occurrences<- function(dtm, token) dtm[, token] %>% as.data.frame() %>% rename(count=".") %>% mutate(token=row.names(.)) %>% arrange(-count) get.total.freq<- function(dtm, token) dtm[, token] %>% sum get.doc.freq<- function(dtm, token) dtm[, token] %>% as.data.frame() %>% rename(count=".") %>% filter(count>0) %>% pull(count) %>% length
Then we can happily extract information. For example, if we want to get the top-most common words in 2009’s Obama’s speech, enter:
dtm %>% get.doc.tokens('2009-Obama') %>% head(10)
Or which speeches have the word “change”: (but need to stem the word before extraction)
dtm %>% get.token.occurrences(wordStem('change')) %>% head(10)
You can also get the total number of occurrence of the words by:
dtm %>% get.doc.freq(wordStem('change')) # gives 28
import shorttext import numpy as np import pandas as pd from stemming.porter import stem import re
And define the preprocessing pipelines:
pipeline = [lambda s: re.sub('[^\w\s]', '', s), lambda s: re.sub('[\d]', '', s), lambda s: s.lower(), lambda s: ' '.join(map(stem, shorttext.utils.tokenize(s))) ] txtpreproceesor = shorttext.utils.text_preprocessor(pipeline)
The function <code>txtpreprocessor</code> above perform the functions we talked about in R.
Load the dataset:
usprezdf = pd.read_csv('inaugural.csv')
The corpus needs to be preprocessed before putting into the DTM:
docids = list(usprezdf['yrprez']) # defining document IDs corpus = [txtpreproceesor(speech).split(' ') for speech in usprezdf['speech']]
Then create the DTM:
dtm = shorttext.utils.DocumentTermMatrix(corpus, docids=docids, tfidf=False)
Then we do the same thing as we have done above. To get the top-most common words in 2009’s Obama’s speech, enter:
Or we look up which speeches have the word “change”:
Or to get the document frequency of the word:
They Python and R codes give different document frequencies probably because the two stemmers work slightly differently.