Ethics and Political Correctness in Algorithms

Recently I read an article regarding ethics in data science. The ethics here is not about plagiarism, disclosure of confidential data, or dishonesty, but the decision in designing a model with the consideration of ethics. This sparked my thinking without any conclusions.

A lot of countries have a long and painful history of racism. In America, not to even mention the history of slavery, a recent verdict against a Chinese-American police officer induced a nationwide Asian-American campaign, given the history of Chinese Exclusion Act. Recruitment nowadays has to be technically not based on race, but we all know that racism in job market still virtually exists. When, like in the article, a public policy is enacted with the help of an algorithm, a tendency to racism can be problematic. For some algorithms, people might not know that race is taken in the model unless someone is monitoring. The data scientists can secretly put that in without cost. But is it ethical?

Or it can be that because the data is so historical that it carries a race-biased history, but we know that race is not a factor to a particular situation. We may simply throw away race in the model; or even worse, we need a “counter-term” to combat this dark history in the data to build a useful predictive model.

Sometimes, it might be favorable to put race in the model so that even the underprivileged peoples are also happy. For example, instead of public policy, I am writing a dating website. Race, gender and sexual orientation are important too, besides personality types, age difference etc.

Because a lot of algorithms, such as SVM or neural network, work like a black box, we do not immediately know the biased effect. But if it turns out it is not obvious or people are simply happy, it seems it does not matter. But is it?

Or do we actually over-consider? People might not care as much as you think, but the scientists may be held liable. Political correctness can be a killer. Maybe it is the reason why there are so many headline stories in the primary presidential campaign now.


Continue reading “Ethics and Political Correctness in Algorithms”

LDA2Vec: a hybrid of LDA and Word2Vec

Both LDA (latent Dirichlet allocation) and Word2Vec are two important algorithms in natural language processing (NLP). LDA is a widely used topic modeling algorithm, which seeks to find the topic distribution in a corpus, and the corresponding word distributions within each topic, with a prior Dirichlet distribution. Word2Vec is a vector-representation model, trained from RNN (recurrent neural network), to seek a continuous representation for words.

They are both very useful, but LDA deals with words and documents globally, and Word2Vec locally (depending on adjacent words in the training data). A LDA vector is so sparse that the users can interpret the topic easily, but it is inflexible. Word2Vec’s representation is not human-interpretable, but it is easy to use. In his slides, Chris Moody recently devises a topic modeling algorithm, called LDA2Vec, which is a hybrid of the two, to get the best out of the two algorithms.

Honestly, I never used this algorithm. I rarely talk about something I didn’t even try, but I want to raise awareness so that more people know about it when I come to use it. To me, it looks like concatenating two vectors with some hyperparameters, but ¬†the source codes rejects this claim. It is a topic model algorithm.

There are not many blogs or papers talking about LDA2Vec yet. I am looking forward to learning more about it when there are more awareness.

Jupyter Notebook for LDA2Vec Demonstration [link]
Continue reading “LDA2Vec: a hybrid of LDA and Word2Vec”

Blog at

Up ↑