Ethics and Political Correctness in Algorithms

Recently I read an article regarding ethics in data science. The ethics here is not about plagiarism, disclosure of confidential data, or dishonesty, but the decision in designing a model with the consideration of ethics. This sparked my thinking without any conclusions.

A lot of countries have a long and painful history of racism. In America, not to even mention the history of slavery, a recent verdict against a Chinese-American police officer induced a nationwide Asian-American campaign, given the history of Chinese Exclusion Act. Recruitment nowadays has to be technically not based on race, but we all know that racism in job market still virtually exists. When, like in the article, a public policy is enacted with the help of an algorithm, a tendency to racism can be problematic. For some algorithms, people might not know that race is taken in the model unless someone is monitoring. The data scientists can secretly put that in without cost. But is it ethical?

Or it can be that because the data is so historical that it carries a race-biased history, but we know that race is not a factor to a particular situation. We may simply throw away race in the model; or even worse, we need a “counter-term” to combat this dark history in the data to build a useful predictive model.

Sometimes, it might be favorable to put race in the model so that even the underprivileged peoples are also happy. For example, instead of public policy, I am writing a dating website. Race, gender and sexual orientation are important too, besides personality types, age difference etc.

Because a lot of algorithms, such as SVM or neural network, work like a black box, we do not immediately know the biased effect. But if it turns out it is not obvious or people are simply happy, it seems it does not matter. But is it?

Or do we actually over-consider? People might not care as much as you think, but the scientists may be held liable. Political correctness can be a killer. Maybe it is the reason why there are so many headline stories in the primary presidential campaign now.

political_correctness

Continue reading “Ethics and Political Correctness in Algorithms”

Advertisements

Computer Science Eclipsing Funding for Statisticians

The current trend of data science makes a collection of algorithms, known as machine learning, to be the “golden key” of all numerical problems. I used double quotes because I know that it cannot do everything.

A lot of these algorithms are optimization problems. And many of them are related to statistics, for example, hidden Markov model, Bayesian networks, conditional random field etc. However, all of these are very different from classical statistics.

Norman Matloff, a professor of computer science in UC Davis and a statistician, expressed concern in an article posted on AMSTAT News. He thinks that because of the engineering research model of computer science, effort has been spent to published good results instead of understanding the science behind. Very often, computer scientists are reinventing the wheel, and publishing something that statisticians did a long time ago because they just do not have the room to look up what has been done.

I think what Matloff said is fair. However, I think statistics and computer science are working on problems in a different focus. Classical statistics often deals with data sampled from a larges population, and works on the implication; but machine learning algorithms often deals with the pragmatic predictive analytics based on a model derived from all the real data available. Classical statistics studies the meaning of a small sample to a large population; computer science often extracts information from a large population.

Philipp Janert wrote in his Data Analysis Using Open Source Tools that:

  1. “It should therefore come as no surprise that the methods developed by those early researchers seem so out of place to us: they spent a great amount of effort and ingenuity solving problems we simply no longer have! This realization goes a long way toward explaining why classical statistics is the way it is and why it often seems so strange to us today.By contrast, modern statistics is very different. It places greater emphasis on nonparametric methods and Bayesian reasoning, and it leverages current computational capabilities through simulation and resampling methods. The book by Larry Wasserman (see the recommended reading at the end of this chapter) provides an overview of a more contemporary point of view.”

It does not mean classical statisticians and computer scientists cannot live together. Matloff suggested curricula that students majored in computer science and statistics share some core courses together.

Although computer science works on statistics in a quite different way, classical statistics still plays a major role in scientific research because they do not have that abundant data because each of their data point is expensive.

Continue reading “Computer Science Eclipsing Funding for Statisticians”

Computational Journalism

We “sensed” what has been the current hot issues in the past (and we still often do today.) Methods of “sensing,” or “detecting”, is now more sophisticated however as the computational technologies are now more advanced. The methods involved can be collected to a field called “computational journalism.”

Recently, there is a blog post by Jeiran about understanding the public impression about Iran using computational methods. She divided the question into the temporal and topical perspectives. The temporal perspective is about various time-varying patterns of the number of related news articles; the topical perspective is about the distribution of various topics, using latent Dirichlet allocation (LDA), and Bayes’ Theorem. The blog post is worth reading.

In February last year, there was a video clip online that Daeil Kim, a data scientist at New York Times, spoke at NYC Data Science Meetup. Honestly, I still have not watched it yet (but I think I should have.) What his work is also about computational journalism, on his algorithm, and LDA.

Of course, computational journalism is the application of natural language processing and machine learning on news articles… However, as a computational physicist has to know physics, a computational journalist has to know journalism. A data scientist has to be someone who knows the technology and the subject matter.

Continue reading “Computational Journalism”

Core Competencies of Data Science Education

What should a data scientist know? What are the core skills of a data scientist? I have not seen another job title so vague and ambiguous that arouses so many debates and discussions. BD2K (Big Data to Knowledge) Centers of NIH (National Institutes of Health) [Ohno-Machado 2014] have issued funding to a few tertiary colleges in the United States to develop data science curricula, which carries on such discussions.

This is an interdisciplinary field. Around 15 years ago, I was still a matriculation student in Hong Kong. The University of Hong Kong (HKU) started a major called bioinformatics. People were puzzled about what it was indeed, because it looked like a melting pot of several unrelated disciplines (which actually a lot of freshmen complained as they did not understand the purpose of the undergraduate program). But we now understand how it is important.

So what should the students learn? It was suggested in the following figure:

Core Competencies in Big Data (taken from [2015])
Core Competencies in Big Data (taken from [Sainani 2015])
You can see that the core competencies include statistics, machine learning, software engineering, reproducible research, and data visualization. Some of them are math and computer, some sciences, and some arts. And of course, individual data scientist jobs require the corresponding business knowledge.

Honestly, I do not excel in all of them. I have a physics background, which makes it easy for me to learn machine learning and research. Software engineering is not hard to pick up. But statistics is an alien theory to me, and visualization requires the artistic sense that I don’t possess.

Anyway, a lot to learn. Stay humble.

Continue reading “Core Competencies of Data Science Education”

Statistics Nowadays

tmp1

There is no doubt that everyone who are in the so-called big data industry must know some statistics. However, statistics means differently to different peoples.

Traditional Statistics

Statistics is an old field that was developed in the 18th century. In those times, people were urged to make conclusions out of a vast amount of data which were virtually not available, or were very costly to obtain. For example, someone wanted to know the average salary of the whole population, which required the census staff to survey the information from everyone in the population. It was something expensive to do in the old days. Therefore, sampling techniques were devised, and the wanted quantities can be estimated using an appropriate statistic.

Or when the scientists performed an experiment, even one data point costs a few million dollars. The experiments had to be designed in a way that the scientists extract the wanted information by looking at a few data points.

Or in testing some hypotheses, one needs to know only how to accept or reject a hypothesis using the statistical information available.

Hence, the traditional statistics is a body of knowledge that deduce the information of a whole population from a limited amount of data from a sample.

Theoretical Statistical Physics

There is a branch in physics called statistical physics, which originated from the 19th century. Later it became useful since Albert Einstein published its paper on Brownian motion in 1905. And now the methods in statistical physics is not only applied in solid state physics or condensed matter physics, but also in biophysics (e.g., diffusion), econophysics (e.g., the fairness and wealth distribution, see this previous blog post), and quantitative finance (e.g., binomial model, and its relation with Black-Scholes equation).

The techniques involved in statistical physics includes is the knowledge of probability theory and stochastic calculus (such as Ito calculus). Of course, it is how entropy, a concept from thermodynamics, entered probability theory and information theory. Extracted quantity are mostly expectation values and correlations, which are of interest to theorists.

This is very different from traditional statistics. When people know that I am a statistical physicist, they expect me to be familiar with t-test, which is not really the case. (Very often I have to look up every time I used them.)

Statistics in the Computing World

Unlike in traditional statistics or statistical physics, nowadays, we often get the statistical information directly from a vast amount of available data, thanks to the advance of technology and the reducing cost to access the technology. You can easily calculate the average salary of a population by a single command line on R or Python. Hence, statistics is no longer about extracting information from a limited amount of data, but a vast amount of data.

On the other hand, mathematical modeling is still important, but in a different sense. Models in statistical physics describes the world, but in information retrieval, models are built according to what we need.

P.S.: Philipp Janert wrote something similar in his Chapter 10 (“What You Really Need to Know About Classical Statistics”) in his “Data Analysis Using Open Source Tools“:

The basic statistical methods that we know today were developed in the late 19th and early 20th centuries, mostly in Great Britain, by a very small group of people. Of those, one worked for the Guinness brewing company and another—the most influential one of them—worked at an agricultural research lab (trying to increase crop yields and the like). This bit of historical context tells us something about their working conditions and primary challenges.

No computational capabilities All computations had to be performed with paper and pencil.

No graphing capabilities, either All graphs had to be generated with pencil, paper, and a ruler. (And complicated graphs—such as those requiring prior transformations or calculations using the data—were especially cumbersome.)

Very small and very expensive data sets Data sets were small (often not more than four to five points) and could be obtained only with great difficulty. (When it always takes a full growing season to generate a new data set, you try very hard to make do with the data you already have!)

In other words, their situation was almost entirely the opposite of our situation today:

  • Computational power that is essentially free (within reason)
  • Interactive graphing and visualization capabilities on every desktop
  • Often huge amounts of data

It should therefore come as no surprise that the methods developed by those early researchers seem so out of place to us: they spent a great amount of effort and ingenuity solving problems we simply no longer have! This realization goes a long way toward explaining why classical statistics is the way it is and why it often seems so strange to us today.

P.S.: The graph at the beginning of this blog entry was plotted in Mathematica, by running the following:

Plot[Evaluate@Table[PDF[MaxwellDistribution[σ], x], {σ, {1, 2, 3}}], {x, 0, 10}, Filling -> Axis]

Continue reading “Statistics Nowadays”

Scientific Models in the Computing Era

Rplot_KwanRevenue

In my very first class to introductory college physics, I was told that a good scientific model have a general descriptive power. While it is true, I found that it is a oversimplified statement after I was exposed to computational science and other fields outside physics.

Descriptive power is important, as in Shelling model in economics; but to many scientists and engineers, a model is devised because of its predictive power, which can be seen as one aspect of its descriptive power. Predictive power is a useful feature. All physics and engineering models have predictive power in a quantitative sense.

Physics models have to be descriptive in a sense that the models are describing physical things; but in the big data era, a lot of machine learning models are like black boxes, which means we care only about the meaning of inputs and outputs, but the content of the models do not necessarily carry a meaning. SVM and deep learning are good examples. (This in fact bothers some people.) But of course, in physics, there are a lot of phenomenological models that are between descriptive models and black boxes, such as the Ginzburg-Landau-Wilson (GLW) model widely used in magnets, superfluids, helimagnets, superconductors, liquid crystals etc. However, to be fair, some machine learning models are quite descriptive, such as clustering, Gaussian mixtures, MaxEnt etc.

A lot of traditional physics models are equation-based, but computational models are not. The reasons are evident. And of course, traditional physics models are mostly continuous while computational models are sometimes discrete. If the computational models are continuous and equation-based, they will be translated to discrete version for the computing machines to handle correctly.

However, whichever forms the models take, we, human beings, are essential to give the models meaning so that they are useful.

Continue reading “Scientific Models in the Computing Era”

Dream of Automation

It is a fantasy for a lot of entrepreneurs, scientists and engineers to develop a software project that can automatically perform feature generation, training, and prediction automatically.

Of course it is a wishful thinking. There is no free lunch.

In big companies that have abundant resources (training data, brains, clusters), they can probably so something like deep learning to get the relevant features, and build classification models. It is almost automatic. It virtually takes no manual addition of human knowledge. Some scientists and engineers are enjoying the strength of word2vec, but it takes a lot of computer resources to even train a word2vec model.

If we do not have enough training data or computing resources, to get a good classifier, we ought to add human knowledge to generate features. We might even need to impose some rules to convert the raw data to sensible features. The rules might be regular expressions, or some calculations, or some filters, or it involves a knowledge database (like WordNet). Things might be simplified if the problem we are dealing with is in a specific domain, that reduces the amount of human knowledge we need to add.

Stochastics and Sentiment Analysis in Wall Street

Wall Street is not only a place of facilitating the money flow, but also a playground for scientists.

When I was young, I saw one of my uncles plotting prices for stocks to perform technical analysis. When I was in college, my friends often talked about investing in a few financial futures and options. When I was doing my graduate degree in physics, we studied John Hull’s famous textbook [Hull 2011] on quantitative finance to learn about financial modeling. A few of my classmates went to Wall Street to become quantitative analysts or financial software developers. There are ups and downs in the financial markets. But as long as we are in a capitalist society, finance is a subject we never ignore. However, scientists have not come up with a consensus about the nature of a financial market.

Agent-Based Models

Economists believe that individuals in a market are rational being who always aim at maximizing their profits. They often apply agent-based models, which employs complex system theories or game theory.

Random Processes and Statistical Physics

However, a lot of mathematicians in Wall Street (including quantitative analysts and econophysicists) see the stock prices as undergoing Brownian motion. [Hull 2011, Baaquie 2007] They employ tools in statistical physics and stochastic processes to study the pricings of various financial derivatives. Therefore, the random-process and econophysical approaches have nothing much about stock price prediction (despite the fact that they do need a “return rate” in their model.) Random processes are unpredictable.

However, some sort of predictions carry great values. For example, when there is overhypes or bubbles in the market, we want to know when it will burst. There are models that predict defaults and bubble burst in a market using the log-periodic power law (LPPL). [Wosnitza, Denz 2013] In addition, there has been research showing the leverage effect in stock markets in developed countries such as Germany (c.f. fluctuation-dissipation theorem in statistical physics), and anti-leverage effect in China (Shanghai and Shenzhen). [Qiu, Zhen, Ren, Trimper 2006]

Reconciling Intelligence and Randomness

There are some values to both views. It is hard to believe that stock prices are completely random, as the economic environment and the public opinions must affect the stock prices. People can neither be completely rational nor completely random.

There has been some study in reconciling game theory and random processes, in an attempt to bring economists and mathematicians together. In this theoretical framework, financial systems still sought to attain the maximum entropy (randomness), but the “particles” in the system behaves intelligently. [Venkatasubramanian, Luo, Sethuraman 2015] (See my another blog entry: MathAnalytics (1) – Beautiful Mind, Physical Nature and Economic Inequality) We are not sure how successful this attempt will be at this point.

Sentiment Analysis

As people are talking about big data in recent years, there have been attempts to apply machine learning algorithms in finance. However, scientists tend not to price using machine learning algorithms because these algorithms mostly perform classification. However, there are attempts, with natural language processing (NLP) techniques, to predict the stock prices by detecting the public emotions (or sentiments) in social media such as Twitter. [Bollen, Mao, Zeng 2010] It has been found that measuring the public mood in a few dimensions (including Calm, Alert, Sure, Vital, Kind, and Happy) allows scientists to accurately predict the trend of Dow Jones Industrial Average (DJIA). However, some hackers take advantage on the sentiment analysis on Twitter. In 2013, there was a rumor on Twitter saying the White House being bombed, The computers responded instantly and automatically by performing trading, causing the stock market to fall immediately. But the market restored quickly after it was discovered that the news was fake. (Fig. 1)

Fig. 1: DJIA fell because of a rumor of the White House being bombed, but recovered when discovered the news was fake (taken from http://www.rt.com/news/syrian-electronic-army-ap-twitter-349/)

P.S.: While I was writing this, I saw an interesting statement in the paper about leverage effect. [Qiu, Zhen, Ren, Trimper 2006] The authors said that:

Why do the German and Chinese markets exhibit different return-volatility correlations? Germany is a developed country. To some extent, people show risk aversion, and therefore, may be nervous in trading as the stock price is falling. This induces a higher volatility. When the price is rising, people feel safe and are inactive in trading. Thus, the stock price tends to be stable. This should be the social origin of the leverage effect. However, China just experiences the first stage of capitalism, and people are somewhat excessive speculative in the financial markets. Therefore, people rush for trading as the stock price increases. When the price drops, people stay inactive in trading and wait for rising up of the stock price. That explains the antileverage effect.

Does this paragraph written in 2006 give a hint of what happened in China in 2015 now? (Fig. 2)

Fig. 2: The fall of Chinese stock market in 2015 (taken from http://www.economicpolicyjournal.com/2015/06/breaking-biggest-chinese-stock-market.html)

Continue reading “Stochastics and Sentiment Analysis in Wall Street”

Choices of Tools

When dealing with data analytics, what kind of things do we usually spend most of our time on?

I would say data cleaning and modeling.

Therefore, it is not merely software development. While we sometimes spend a lot of time in software architecture (which is important), before doing that, we have to explore what we want. Very often, data come in various formats, or we need to manually clean them. And very often we do not know which algorithms to use. We need to explore different ways to perform the experiments before determining what to include in the software project.

That’s why interactive programming comes into place for analytics project. R and MATLAB are these examples. However, they provide poor support for modularizing the codes. Python is a good tool that supports both modularization and interactive programming, but it takes an environment to run Python, which is very often a pain. Provided that a lot of good libraries are written in Java, having the need to perform both software development and data analytics, Scala, a JVM language that supports interactive programming, will be the next generation of programming language.

IMG_20150107_201432

Create a free website or blog at WordPress.com.

Up ↑