The 2016 US Presidential Election ended with a surprise that Mr. Donald Trump won, despite the overwhelming prediction of a Clinton victory. There have been many studies challenging the theories in traditional political forecasting.

Some took an approach regarding statistics. Many studies concluded that many election forecasting models did not take into account between individual states predictions. However, a classical computation method limited such type of models that connects individual states (or fully-connected models). Hence, a group from QxBranch and Standard Cognition resorted to adiabatic quantum computation. (See: arXiv:1802.00069.)

D-Wave computers are adiabatic quantum computers that perform quantum annealing. A D-Wave 2X has 1152 qubits, and can naturally describes a Boltzmann Machine (BM) model, equivalent to Ising model in statistical physics. The energy function is described by:

$E[\mathbf{s}] = -\sum_{\mathbf{s}_i \in \mathbf{S}} b_i s_i - \sum_{\mathbf{s}_i, \mathbf{s}_j \in \mathbf{S}} W_{ij} s_i s_j$ ,

where $\mathbf{s}$ are the values of all qubits (0, 1, or their superpositions). The field strength $b_i$ and coupling constants $W_{ij}$ can be tuned. Classical models can handle the first term, which is linear; but the correlations, described by the second term, can be computationally costly for classical computers. Hence, the authors used a D-Wave quantum computer to trained the election models from June 30, 2016 to November 11, 2016 for every two weeks, and retrieved the correlations between individual states. Then The correctly simulated that Mr. Trump would win the election.

This Ising model of election was devised after the election, and it is prone to suspicion for fixing the problems using the results. However, this work demonstrated the power of a quantum computer that it solves some political modeling problems that can be too complicated for classical computers.

People have been upset about the prevalence of fake news since the election season last year. Election has been a year, but fake news is still around because the society is still politically charged. Some tech companies vowed to fight against fake news, but, easy to imagine, this is a tough task.

On Aug 9, 2017, Data Science DC held an event titled “Fake News as a Data Science Challenge, ” spoken by Professor Jen Golbeck from University of Maryland. It is an interesting talk.

Fake news itself is a big problem. It has philosophical, social, political, or psychological aspects, but Prof. Golbeck focused on its data science aspect. But to make it a computational problem, a clear and succinct definition of “fake news” has to be present, but it is already challenging. Some “fake news” is pun intended, or sarcasm, or jokes (like The Onion). Some misinformation is shared through Twitter or Facebook not because of deceiving purpose. Then a line to draw is difficult. But the undoubtable part is that we want to fight against news with malicious intent.

To fight fake news, as Prof. Golbeck has pointed out, there are three main tasks:

1. detecting the content;
2. detecting the source; and
3. modifying the intent.

Statistical tools can be exploited too. She talked about Benford’s law, which states that, in naturally occurring systems, the frequency of numbers’ first digits is not evenly distributed. Anomaly in the distribution of some news can be used as a first step of fraud detection. (Read her paper.)

There are also efforts, Fake News Challenge for example, in building corpus for fake news, for further machine learning model building.

However, I am not sure fighting fake news is enough. Many Americans are not simply concerned by the prevalence of fake news, but also the narration because of our ideological bias. Sometimes we are not satisfied because we think the news is not “neutral” enough, or, it does not fit our worldview.

The slides can be found here, and the video of the talk can be found here.

Today is the presidential election.

Regardless of the dirty things, we can do some simple simulation about the election. With the electoral college data, and the poll results from various sources, simple simulation can be performed.

Look at this sophisticated model in R: http://blog.yhat.com/posts/predicting-the-presidential-election.html

(If I have time after the election, I will do the simulation too…)

The first presidential debate 2016 was held on September 26, 2016 in Hofstra University in New York. An interesting analysis will be the literacy level demonstrated by the two candidates using Flesch readability ease and Flesch-Kincaid grade level, demonstrated in my previous blog entry and my Github: stephenhky/PyReadability.

First, we need to get the transcript of the debate, which can be found in an article in New York Times. Copy and paste the text into a file called first_debate_transcript.txt. Then we want to extract out speech of each person. To do this, store the following Python code in first_debate_segment.py.

# Trump and Clinton 1st debate on Sept 26, 2016

from nltk import word_tokenize
from collections import defaultdict
import re

def untokenize(words):
"""
Untokenizing a text undoes the tokenizing operation, restoring
punctuation and spaces to the places that people expect them to be.
Ideally, untokenize(tokenize(text)) should be identical to text,
except for line breaks.
"""
text = ' '.join(words)
step1 = text.replace(" ", '"').replace(" ''", '"').replace('. . .',  '...')
step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
step3 = re.sub(r' ([.,:;?!%]+)([ \'"])', r"\1\2", step2)
step4 = re.sub(r' ([.,:;?!%]+)\$', r"\1", step3)
step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
"can not", "cannot")
step6 = step5.replace("  ", " '")
return step6.strip()

ignored_phrases = ['(APPLAUSE)', '(CROSSTALK)']
persons = ['TRUMP', 'CLINTON', 'HOLT']
fin = open('first_debate_transcript.txt', 'rb')
fin.close()

lines = filter(lambda s: len(s)>0, map(lambda s: s.strip(), lines))
speeches = defaultdict(lambda : '')
person = None

for line in lines:
tokens = word_tokenize(line.strip())
ignore_colon = False
for token in tokens:
if token in ignored_phrases:
pass
elif token in persons:
person = token
ignore_colon = True
elif token == ':':
ignore_colon = False
else:
speeches[person] += ' ' + untokenize(added_tokens)

for person in persons:
fout = open('speeches_'+person+'.txt', 'wb')
fout.write(speeches[person])
fout.close()


There is an untokenize function adapted from a code in StackOverflow. This segmented the transcript into the individual speech of Lester Holt (the host of the debate), Donald Trump (GOP presidential candidate), and Hillary Clinton (DNC presidential candidate) in separate files. Then, on UNIX or Linux command line, run score_readability.py on each person’s script, by, for example, for Holt’s speech,

python score_readability.py speeches_HOLT.txt --utf8

Beware that it is encoded in UTF-8. For Lester Holt, we have

Word count = 1935
Sentence count = 157
Syllable count = 2732
Flesch-Kincaid grade level = 5.87694629602

For Donald Trump,

Word count = 8184
Sentence count = 693
Syllable count = 10665
Flesch-Kincaid grade level = 4.3929136992

And for Hillary Clinton,

Word count = 6179
Sentence count = 389
Syllable count = 8395
Flesch-Kincaid grade level = 6.63676650035