Essential Python Packages

Almost three years ago, I wrote a blog entry titled Useful Python Packages, which listed the essential packages that I deemed important. How has the list been changed over the past three years?

First of all, three years ago, most people were still writing Python 2.7. But now there is a trend to switch to Python 3. I admitted that I still have not started the switch yet, but in the short term, I will have no choice and I will.

What are some of the essential packages?
Numerical Packages

  • numpy: numerical Python, containing most basic numerical routines such as matrix manipulation, linear algebra, random sampling, numerical integration etc. There is a built-in wrapper for Fortran as well. Actually, numpy is so important that some Linux system includes it with Python.
  • scipy: scientific Python, containing some functions useful for scientific computing, such as sparse matrices, numerical differential equations, advanced linear algebra, special functions etc.
  • networkx: package that handles various types of networks
  • PuLP: linear programming
  • cvxopt: convex optimization

Data Visualization

  • matplotlib: basic plotting.
  • ggplot2: the ggplot2 counterpart in Python for producing quality publication plots.

Data Manipulation

  • pandas: data manipulation, working with data frames in Python, and save/load of various formats such as CSV and Excel

Machine Learning

  • scikit-learn: machine-learning library in Python, containing classes and functions for supervised and unsupervised learning

Probabilistic Programming

  • PyMC: Metropolis-Hasting algorithm
  • Edward: deep probabilistic programing

Deep Learning Frameworks

  • TensorFlow: because of Google’s marketing effort, TensorFlow is now the industrial standard for building deep learning networks, with rich source of mathematical functions, esp. for neural network cells, with GPU capability
  • Keras: containing routines of high-level layers for deep learning neural networks, with TensorFlow, Theano, or CNTK as the backbone
  • PyTorch: a rivalry against TensorFlow

Natural Language Processing

  • nltk: natural language processing toolkit for Python, containing bag-of-words model, tokenizer, stemmers, chunker, lemmatizers, part-of-speech taggers etc.
  • gensim: a useful natural language processing package useful for topic modeling, word-embedding, latent semantic indexing etc., running in a fast fashion
  • shorttext: text mining package good for handling short sentences, that provide high-level routines for training neural network classifiers, or generating feature represented by topic models or autoencodings.
  • spacy: industrial standard for natural language processing common tools

GUI

I can probably list more, but I think I covered most of them. If you do not find something useful, it is probably time for you to write a brand new package.

Coffee Cooling: a Demo of Numerical ODEs in Python

A lot of Americans are addicted to coffee. I have one cup a day. I know someone has a few cups every day. Weird enough, I know someone who needs coffee to go to bed.

Coffee is for sipping, but we do want to enjoy it as soon as possible at a right temperature. Whether we add sugar cubes at the beginning or the end can make a lot of difference. In general, coffee cooling follows empirically the Newton’s law of cooling:

\frac{dT}{dt} = -k (T-T_r),

where T_r is the room temperature, and k is the cooling parameter. It is a simple ordinary differential equation (ODE) that bears analytical solution. However, let’s solve it numerically while we demonstrate the NumPy/SciPy capabilities. The function we are using is odeint. Let’s import necessary packages:

import numpy as np
from scipy.integrate import odeint
from functools import partial

We will need the partial function from functools shortly. Let’s define the right-hand-side of the ODE above as the cooling function in terms of a lambda expression:

coolingfcn = lambda temp, t0, k, roomtemp: -k*(temp-roomtemp)

And we need a set of cooling parameters that makes sense from our daily experience:

cooling_parameters = {'init_temp': 94.0, # initial temperature of the coffee (degrees Celcius)
                      'cube_dectemp': 5.0, # temperature decrease due to the cube
                      'roomtemp': 25.0, # room temperature
                      'k': 1.0/15.0 # cooling parameter (inverse degrees Celcius)
                     }

The fresh-brewed coffee is of 94 degrees Celcius. The room temperature is 25 degrees Celcius. Whenever a whole cube is added to the coffee, its temperature drops by 5 degrees Celcius, which is not a good assumption when the coffee is totally cooled down to room temperature. The cooling parameter k=\frac{1}{15} \text{min}^{-1}, meaning in 15 minutes, the coffee was cooled by e^{-1} of its temperature difference from the environment.

From the SciPy documentation of odeint, the first three parameters are important:

scipy.integrate.odeint(func, y0, t, args=(), Dfun=None, col_deriv=0, full_output=0, ml=None, mu=None, rtol=None,atol=None, tcrit=None, h0=0.0, hmax=0.0, hmin=0.0, ixpr=0, mxstep=0, mxhnil=0, mxordn=12, mxords=5,printmessg=0)

The first parameter is the cooling function, and the second is the initial value of the dependent function (temperature in this case), and the third is the ndarray of times of the temperatures to be calculated. We can easily write the functions that return the ndarray of the temperatures for the cases of adding sugar cube before and after the cooling process respectively:

# adding sugar cube before cooling
def cube_before_cooling(t, init_temp, cube_dectemp, roomtemp, k):
    y0 = init_temp - cube_dectemp
    temps = odeint(partial(coolingfcn, k=k, roomtemp=roomtemp), y0, t)
    return temps

# adding sugar cube after cooling
def cube_after_cooling(t, init_temp, cube_dectemp, roomtemp, k):
    temps = odeint(partial(coolingfcn, k=k, roomtemp=roomtemp), init_temp, t)
    temps[-1] -= cube_dectemp
    return temps

We can plot the temperature changes on graphs using matplotlib.

import matplotlib.pyplot as plt

And we are interested in the first 20 minutes:

times = np.linspace(0, 20, 201)

For adding a sugar cube before cooling,

temps_beforecooling = cube_before_cooling(times, **cooling_parameters)
plt.plot(times, temps_beforecooling)
plt.xlabel('Time (min)')
plt.ylabel('Coffee Temperature (degrees Celcius)')

(If you are typing in an iPython shell, you need to put plt.show() to show the plot.) This gives the following plot:

download (2)

And for adding the sugar cube after cooling,

temps_aftercooling = cube_after_cooling(times, **cooling_parameters)
plt.plot(times, temps_aftercooling)
plt.xlabel('Time (min)')
plt.ylabel('Coffee Temperature (degrees Celcius)')

This gives the following plot:

download (3)

Obviously, adding the sugar cube after cooling gives a lower temperature.

Advanced Gimmicks

How about we add sugar continuously throughout the 20-minute time span?  We need to adjust our differential equation to be:

\frac{dT}{dt} = -k (T-T_r) - \frac{T_{\text{drop}}}{\delta t},

which can be implemented by the following code:

# adding sugar continuously
def continuous_sugar_cooling(t, init_temp, cube_dectemp, roomtemp, k):
    timespan = t[-1] - t[0]
    newcoolingfcn = lambda temp, t0: coolingfcn(temp, t0, k, roomtemp) - cube_dectemp / timespan
    temps = odeint(newcoolingfcn, init_temp, t)
    return temps

or we can divide the cube to be added to the cup at a few time spots, which can be implemented by the following code that employs our previously defined functions:

# adding sugar at a specified time(s)
def sugar_specifiedtime_cooling(t, sugar_time_indices, init_temp, cube_dectemp, roomtemp, k):
    sorted_sugar_time_indices = np.sort(sugar_time_indices)
    num_portions = len(sorted_sugar_time_indices)

    temps = np.array([])
    temp = init_temp
    t_segments = np.split(t, sorted_sugar_time_indices)
    num_segments = len(t_segments)
    for i in range(num_segments):
        temp_segment = cube_after_cooling(t_segments[i],
                                          temp,
                                          float(cube_dectemp)/num_portions,
                                          roomtemp,
                                          k)
        temps = np.append(temps, temp_segment)
        temp = temp_segment[-1]
    temps[-1] += float(cube_dectemp)/num_portions

    return temps

Let’s calculate all the temperatures:

temps_cont = continuous_sugar_cooling(times, **cooling_parameters)
temps_n = sugar_specifiedtime_cooling(times, [50, 100, 150], **cooling_parameters)

And plot all the graphs together:

plt.plot(times, temps_beforecooling, label='cube before cooling')
plt.plot(times, temps_aftercooling, label='cube after cooling')
plt.plot(times, temps_cont, label='continuous')
plt.plot(times, temps_n, label='3 times of dropping cube')
plt.xlabel('Time (min)')
plt.ylabel('Coffee Temperature (degrees Celcius)')
plt.legend()

download (4)

Complete code demonstration can be found in my github (GitHub/stephenhky/CoffeeCooling)

Continue reading “Coffee Cooling: a Demo of Numerical ODEs in Python”

Useful Python Packages

python
(Taken from http://latticeqcd.org/pythonorg/static/images/antigravity.png, adapted from http://xkcd.com/353/)

Python is the basic programming languages if one wants to work on data nowadays. Its popularity comes with its intuitive syntax, its support of several programming paradigms, and the package numpy (Numerical Python). Yes, if you asked which package is a “must-have” outside the standard Python packages, I would certainly name numpy.

Let me list some useful packages that I have found useful:

  1. numpy: Numerical Python. Its basic data type is ndarray, which acts like a vector with vectorized calculation support. It makes Python to perform matrix calculation efficiently like MATLAB and Octave. It supports a lot of commonly used linear algebraic algorithms, such as eigenvalue problems, SVD etc. It is the basic of a lot of other Python packages that perform heavy numerical computation. It is such an important package that, in some operating systems, numpy comes with Python as well.
  2. scipy: Scientific Python. It needs numpy, but it supports also sparse matrices, special functions, statistics, numerical integration…
  3. matplotlib: Graph plotting.
  4. scikit-learn: machine learning library. It contains a number of supervised and unsupervised learning algorithms.
  5. nltk: natural language processing. It provides not only basic tools like stemmers, lemmatizers, but also some algorithms like maximum entropy, tf-idf vectorizer etc. It provides a few corpuses, and supports WordNet dictionary.
  6. gensim: another useful natural language processing package with an emphasis on topic modeling. It mainly supports Word2Vec, latent semantic indexing (LSI), and latent Dirichlet allocation (LDA). It is convenient to construct term-document matrices, and convert them to matrices in numpy or scipy.
  7. networkx: a package that supports both undirected and directed graphs. It provides basic algorithms used in graphs.
  8. sympy: Symbolic Python. I am not good at this package, but I know mathics and SageMath are both based on it.
  9. pandas: it supports data frame handling like R. (I have not used this package as I am a heavy R user.)

Of course, if you are a numerical developer, to save you a good life, install Anaconda.

There are some other packages that are useful, such as PyCluster (clustering), xlrd (Excel files read/write), PyGame (writing games)… But since I have not used them, I would rather mention it in this last paragraph, not to endorse but avoid devaluing it.

Don’t forget to type in your IPython Notebook:

import antigravity

Continue reading “Useful Python Packages”

Ranking Everything: an Overview of Link Analysis Using PageRank Algorithm

This is an age of quantification, meaning that we want to give everything, even qualitative, a number. In schools, teachers measure how good their students master mathematics by grading, or scoring their homework. The funding agencies measure how good a scientist is by counting the number of his publications, the citations, and the impact factors. We measure how successful a person is by his annual income. We can question all these approaches of measurement. Yet however good or bad the measures are, we look for a metric to measure.

Original PageRank Algorithm

We measure webpages too. In the early ages of Internet, people performed searching on sites such as Yahoo or AltaVista. The keywords they entered are the main information for the browser to do the searching. However, a big problem was that a large number of low quality or irrelevant webpages showed up in search results. Some were due to malicious manipulation of keyword tricks. Therefore, it gave rise a need to rank the webpages. Larry Page and Sergey Brin, the founders of Google, tackled this problem as a thesis topic in Stanford University. But this got commercialized, and Brin never received his Ph.D. They published their algorithm, called PageRank, named after Larry Page, at the Seventh International World Wide Web Conference (WWW7) in April 1998. [Brin & Page 1998] This algorithm is regarded as one of the top ten algorithms in data mining by a survey paper published in the IEEE International Conference on Data Mining (ICDM) in December 2006. [Wu et. al. 2008]

Google-s-Larry-Page-and-Sergey-Brin-Are-3-2-19-Billion-Richer-in-One-Day-392729-2
Larry Page and Sergey Brin (source)

The idea of the PageRank algorithm is very simple. It regards each webpage as a node, and each link in the webpage as a directional edge from the source to the target webpage. This forms a network, or a directed graph, of webpages connected by their links. A link is seen as a vote to the target homepage, and if the source homepage ranks high, it enhances the target homepage’s ranking as well. Mathematically it involves solving a large matrix using Newton-Raphson’s method. (Technologies involving handling the large matrix led to the MapReduce programming paradigm, another big data trend nowadays.)

figure_1_webnet
Example (made by Python with packages networkx and matplotlib)

Let’s have an intuition through an example. In the network, we can easily see that “Big Data 1” has the highest rank because it has the most edges pointing to it. However, there are pages such as “Big Data Fake 1,” which looks like a big data page, but in fact it points to “Porn 1.” After running the PageRank algorithm, it does not have a high rank. The sample of the output is:

[('Big Data 1', 0.00038399273501500979),
('Artificial Intelligence', 0.00034612564364377323),
('Deep Learning 1', 0.00034221161094691966),
('Machine Learning 1', 0.00034177713235138173),
('Porn 1', 0.00033859136614724074),
('Big Data 2', 0.00033182629176238337),
('Spark', 0.0003305912073357307),
('Hadoop', 0.00032928389859040422),
('Dow-Jones 1', 0.00032368956852396916),
('Big Data 3', 0.00030969537721207128),
('Porn 2', 0.00030969537721207128),
('Big Data Fake 1', 0.00030735245262038724),
('Dow-Jones 2', 0.00030461420169420618),
('Machine Learning 2', 0.0003011838672138951),
('Deep Learning 2', 0.00029899313444392865),
('Econophysics', 0.00029810944592071552),
('Big Data Fake 2', 0.00029248837867043803),
('Wall Street', 0.00029248837867043803),
('Deep Learning 3', 0.00029248837867043803)]

You can see those pornographic webpages that pretend to be big data webpages do not have rank as high as those authentic ones. PageRank fights against spam and irrelevant webpages. Google later further improved the algorithm to combat more advanced tricks of spam pages.

You can refer other details in various sources and textbooks. [Rajaraman and Ullman 2011, Wu et. al. 2008]

Use in Social Media and Forums

Mathematically, the PageRank algorithm deals with a directional graph. As one can imagine, any systems that can be modeled as directional graph allow rooms for applying the PageRank algorithm. One extension of PageRank is ExpertiseRank.

Jun Zhang, Mark Ackerman and Lada Adamic published a conference paper in the International World Wide Web (WWW7) in May 2007. [Zhang, Ackerman & Adamic 2007] They investigated into a Java forum, by connecting users to posts and anyone replying to it as a directional graph. With an algorithm closely resembled PageRank, they found the experts and influential people in the forum.

expertiserank
Graphs in ExpertiseRank (take from [Zhang, Ackerman & Adamic 2007])

There are other algorithms like HITS (Hypertext induced topic selection) that does similar things. And social media such as Quora (and its Chinese counterpart, Zhihu) applied a link analysis algorithm (probabilistic topic network, see this.) to perform topic network building. Similar ideas are also applied to identify high-quality content in Yahoo! Answers. [Agichtein, Castillo, Donato, Gionis & Mishne 2008]

Use in Finance and Econophysics

PageRank algorithm is also applied outside information technology fields. Financial engineers and econophysicists applied an algorithm, called DebtRank, which is very similar to PageRank, to determine the systemically important financial institutions in a financial network. This work is published in Nature Scientific Reports. [Battiston, Puliga, Kaushik, Tasca & Caldarelli 2012] In their study, each node represents a financial institution, and a directional edge means the estimated potential impact of an institution to another one. Using DebtRank, we are able to identify the centrally important institutions that potentially impacted other institutions in the network once a financial crisis occurs.

debtrank
D
ebtRank network, taken from [Battiston, Puliga, Kaushik, Tasca & Caldarelli 2012])

Continue reading “Ranking Everything: an Overview of Link Analysis Using PageRank Algorithm”

Blog at WordPress.com.

Up ↑