Kendra Osburn | 11-2-19 | IST 736
At the nexus of machines and humans is the hard-to-grasp, even-harder-to-quantify blanket term “artificial intelligence.” Once a Hollywood blockbuster starring Haley Joel Osment, artificial intelligence is now a Silicon Valley buzzword, like Bitcoin or blockchain, used to excite stakeholders and increase valuations.
In reality, artificial intelligence is considerably less glamorous. Artificial intelligence refers to the application of computing power to a wide variety of tasks that are too tedious for humans, susceptible to human error, or both. For example, let’s imagine that we want to know how the country feels about the President of the United States. In the olden days, before innovations like mass communication, computers and the internet, we’d have to walk door to door, ring the doorbell, interview the inhabitants, take notes, and return to our university, where we would manually sift through notes to pull out words that might seem more “positive” or “negative” in nature. While this might be manageable across a city block or housing subdivision, on a larger scale, it’s nearly impossible.
Even if we could magically snap our fingers and receive one sentence about the President from each person in the United States, we would have over 300 million sentences to review. Moreover, even we could review and categorize each sentence in under a second, it would take us over 9 years of around-the-clock work to complete this task — and by then, we’d have a different president!
Computers, on the other hand, are much better at these kinds of menial tasks — especially those that involve counting. Computers are also very good at performing mathematical equations quickly and efficiently, with numbers too large even for our confusingly expensive Texas Instruments calculators. By leveraging these machine skills in service of a more nuanced or complex objective — for instance, assessing people’s feelings — artificial intelligence can train computers to do even more amazing things.
What happens when we come across a task a human still can perform better than a machine? What happens when this task involves detecting lies or identifying sarcasm, where our reasoning is difficult to articulate or quantify beyond a “gut feeling”? How do we measure “gut feeling,” and how can we train a computer on something so nebulous?
Enter Amazon and its Mechanical Turk program. Touted as “artificial” artificial intelligence, Amazon Mechanical Turk (AMT) farms out tasks that involve “gut feeling” to hundreds of thousands of human workers (called “turkers”) a small sum. Amazon’s objective is to collect turkers’ data with the goal of automating them out of existence. Until that day arrives, however, the turkers at AMT are here to help those of us unfortunate enough to conduct a research project with unlabeled data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
neg = pd.read_csv('AMT_neg.csv')
pos = pd.read_csv('AMT_pos.csv')
from tabulate import tabulate
df = neg.copy()
df = df[['WorkerId', 'Answer.sentiment.label']]
print(tabulate(df[:5], tablefmt="rst", headers=df.columns))
def get_unique(df, column):
unique = np.unique(df[column], return_counts=True)
df = pd.DataFrame(zip(unique[0], unique[1]))
return len(unique[0]), unique, df
num_neg, unique_neg, u_neg_df = get_unique(neg, 'WorkerId')
num_pos, unique_pos, u_pos_df = get_unique(pos, 'WorkerId')
print(num_neg, 'Turkers worked on NEG batch')
print(num_pos, 'Turkers worked on POS batch')
u_neg_df.plot(kind='bar',x=0,y=1)
u_pos_df.plot(kind='bar',x=0,y=1)
max
and min
HIT for unique turkers¶print('For {}, the min was: {} and the max was: {}'.format('neg', unique_neg[1].min(), unique_neg[1].max()))
print('For {}, the min was: {} and the max was: {}'.format('pos', unique_pos[1].min(), unique_pos[1].max()))
import seaborn as sns
import matplotlib.pyplot as plt
sns.catplot(x="Answer.sentiment.label",
y="WorkTimeInSeconds",
kind="bar",
order=['Negative', 'Neutral', 'Positive'],
data=neg);
plt.title('Negative')
sns.catplot(x="Answer.sentiment.label",
y="WorkTimeInSeconds",
kind="bar",
order=['Negative', 'Neutral', 'Positive'],
data=pos)
plt.title('Positive')
response_time = neg[neg['WorkTimeInSeconds'] < 10]
response_time_check = neg[neg['WorkTimeInSeconds'] > 10]
len(response_time)
len(response_time_check)
count = pos.groupby(['WorkerId'])['HITId'].count()
work_time = pos.groupby(['WorkerId'])['WorkTimeInSeconds'].mean()
new_df = pd.DataFrame([work_time, count]).T
new_df.reset_index(inplace=True)
df = new_df.copy()
df = df[['WorkerId', 'WorkTimeInSeconds', 'HITId']]
print(tabulate(df[:5], tablefmt="rst", headers=df.columns))
new_df['WorkTimeInMin'] = new_df['WorkTimeInSeconds']/60
df = new_df.copy()
df = df.sort_values(by='WorkTimeInMin', ascending=False)
df = df[['WorkerId', 'WorkTimeInMin', 'HITId']]
print(tabulate(df[:5], tablefmt="rst", headers=df.columns))
count = pd.DataFrame(pos.groupby(['WorkerId', 'Answer.sentiment.label'])['HITId'].count())
df = count.copy()
print(tabulate(df[:10], tablefmt="rst", headers=df.columns))
pnn = pd.DataFrame()
pnn['Neutral'] = pos.groupby('WorkerId')['Answer.sentiment.label'].apply(lambda x: (x=='Neutral').sum())
pnn['Positive'] = pos.groupby('WorkerId')['Answer.sentiment.label'].apply(lambda x: (x=='Positive').sum())
pnn['Negative'] = pos.groupby('WorkerId')['Answer.sentiment.label'].apply(lambda x: (x=='Negative').sum())
pnn['Total'] = pos.groupby('WorkerId')['Answer.sentiment.label'].apply(lambda x: x.count())
df = pnn.copy()
print(tabulate(df[:10], tablefmt="rst", headers=df.columns))
top = pnn.sort_values(by=['Total'], ascending=False)
df = top.copy()
print(tabulate(df[:10], tablefmt="rst", headers=df.columns))
Interesting!! Looking from here, we have three workers who ONLY chose positive.
Let's look at their response time to see if we can determine if they are a bot!!
top['Avg_WorkTimeInSeconds'] = pos.groupby('WorkerId')['WorkTimeInSeconds'].apply(lambda x: x.mean())
top['Avg_WorkTimeInMin'] = pos.groupby('WorkerId')['WorkTimeInSeconds'].apply(lambda x: x.mean()/60)
top['Min_WorkTimeInMin'] = pos.groupby('WorkerId')['WorkTimeInSeconds'].apply(lambda x: x.min()/60)
top['Max_WorkTimeInMin'] = pos.groupby('WorkerId')['WorkTimeInSeconds'].apply(lambda x: x.max()/60)
df = top.copy()
df.reset_index(inplace=True)
df = df[['WorkerId', 'Neutral', 'Positive','Negative','Avg_WorkTimeInMin']]
print(tabulate(df[:10], tablefmt="rst", headers=df.columns))
Even more interesting! These two don't appear to be bots, based on our current metric which is time variability.
HOWEVER, worker A681XM15AN28F
appears to only work for an average of 13 seconds per review which doesn't seem like enough time to read and judge a review...
TOO MANY REVIEWERS!
Here is when we realized that doing a kappa score with over 30 individual reviewers would be tricky, so we rusubmitted to AMT and required the turkers to be 'Master' in the hopes that this additional barrier-to-entry would help reduce the amount of turkers working on the project
v2 = pd.read_csv('HW5_amt_v2.csv')
v2[:5]
len(v2)
This time, I didn't separate the df into pos and neg before submitting to AMT, so we have to reimport the labels.
labels = pd.read_csv('all_JK_extremes_labeled.csv')
len(labels)
Oops! That's right, we replicated each review * 3 so three separate people could look at each review
labels2 = labels.append([labels] * 2, ignore_index=True)
len(labels2)
turker = pd.read_csv('HW5_amt_294.csv')
df = turker.copy()
df.reset_index(inplace=True)
df = df[['WorkerId', 'Answer.sentiment.label']]
print(tabulate(df[:10], tablefmt="rst", headers=df.columns))
# Getting labels...
labels = pd.read_csv('all_JK_extremes_labeled.csv')
# X3
labels = labels.append([labels] * 2, ignore_index=True)
print(len(labels))
df = labels.copy()
df['short'] = df.apply(lambda x: x['0'].split(' ')[:5], axis=1)
df = df[['PoN', 'short']]
print(tabulate(df[:10], tablefmt="rst", headers=df.columns))
sorted_labels = labels.sort_values(by=['0'])
sorted_turker = turker.sort_values(by=['Input.text'])
# sorted_turker['Input.text'][:5]
OMG HOORAY HOORAY HOORAY!!
NOTE: FUN FACT!! I can type here and then hit the esc
key to turn this cell into markdown!!
# YUCK THIS IS SO AGGRIVATING!! This line below doens't work because it still uses indexes.
# So the P and N didn't match up
# sorted_turker['PoN'] = sorted_labels['PoN']
sorted_turker['PoN'] = sorted_labels['PoN'].tolist()
df = sorted_turker[sorted_turker.columns[-5:]][:10]
df['short'] = df.apply(lambda x: x['Input.text'].split(' ')[1:3], axis=1)
df = df[['short', 'Answer.sentiment.label', 'PoN']]
print(tabulate(df[:10], tablefmt="rst", headers=df.columns))
First, let's clean ALL the things
all_df = sorted_turker[['Input.text', 'WorkerId', 'Answer.sentiment.label', 'PoN']]
df = all_df.copy()
df = df[['WorkerId', 'Answer.sentiment.label', 'PoN']]
print(tabulate(df[:10], tablefmt="rst", headers=df.columns))
all_df_all = all_df.copy()
all_df_all['APoN'] = all_df_all.apply(lambda x: x['Answer.sentiment.label'][0], axis=1)
all_df_all['agree'] = all_df_all.apply(lambda x: x['PoN'] == x['APoN'], axis=1)
df = all_df_all[-10:].copy()
df = df[['WorkerId', 'PoN', 'APoN', 'agree']]
print(tabulate(df[:10], tablefmt="rst", headers=df.columns))
agree_df = pd.DataFrame(all_df_all.groupby(['Input.text','PoN'])['agree'].mean())
agree_df = agree_df.reset_index()
df = agree_df.copy()
df = df[['PoN', 'agree']]
print(tabulate(df[:10], tablefmt="rst", headers=df.columns))
OK so this actually gave us something we want... BUT PLEASE TELL ME THE BETTER WAY!!
def return_agreement(num):
if num == 0:
return 'agree_wrong'
if num == 1:
return 'agree'
if (num/1) !=0:
return 'disparity'
agree_df['agree_factor'] = agree_df.apply(lambda x: return_agreement(x['agree']), axis=1)
agree_df
df = agree_df.copy()
df = df[['PoN', 'agree', 'agree_factor']]
print(tabulate(df[:10], tablefmt="rst", headers=df.columns))
df1 = agree_df.groupby(['agree_factor']).count()
df1.reset_index(inplace=True)
df = df1.copy()
df = df[['agree_factor','Input.text','PoN', 'agree']]
print(tabulate(df[:10], tablefmt="rst", headers=df.columns))
sns.barplot(x=['Agreed', 'Disagreed'],
y= [64,34],
data = df1);
plt.title('How many turkers agreed on sentiment?')
sns.barplot(x="agree_factor", y="agree", data=df1);
plt.title('How many turkers agreed on sentiment, but were wrong?')
df2 = agree_df.groupby(['agree_factor', 'PoN']).count()
df2.reset_index(inplace=True)
sns.barplot(x="agree_factor",
y="agree",
hue="PoN",
data=df2);
plt.title("What was the pos/neg split for the turkers?")
# Example code
from sklearn.metrics import cohen_kappa_score
y1 = [0,1,2,3,4,0,1,2,3,4,0,1,2,3,4]
y2 = [0,1,2,2,4,1,2,3,0,0,0,2,2,4,4]
cohen_kappa_score(y1,y2)
FIRST PASS: Oh boy! This will be super fun. First, I'm going to brainstorm "out loud" how I'm going to do this when AMT doesn't require that the same N turkers complete the task, making inter-rater reliability extremely hard to track when one turker has done 46/98 reviews and another has done 2/98
Let's look at our top turkers
turker_clean = turker[['HITId', 'WorkerId', 'Answer.sentiment.label', 'Input.text']]
turker_clean
df = turker_clean.copy()
df = df[['HITId','WorkerId', 'Answer.sentiment.label']]
print(tabulate(df[:10], tablefmt="rst", headers=df.columns))
And let's see how many turkers turked
turker_counts = pd.DataFrame(turker_clean.WorkerId.value_counts())
df = turker_counts.copy()
print(tabulate(df[:10], tablefmt="rst", headers=df.columns))
OK let's make this easy on ourselves and just use the top 5 turkers for our first test
turker1 = turker_clean[turker_clean['WorkerId'] == 'ARLGZWN6W91WD']
turker2 = turker_clean[turker_clean['WorkerId'] == 'A681XM15AN28F']
turker3 = turker_clean[turker_clean['WorkerId'] == 'A1T79J0XQXDDGC']
turker4 = turker_clean[turker_clean['WorkerId'] == 'A2XFO0X6RCS98M']
turker5 = turker_clean[turker_clean['WorkerId'] == 'A3EZ0H07TSDAPW']
turker1.reset_index(drop=True, inplace=True)
turker2.reset_index(drop=True, inplace=True)
turker3.reset_index(drop=True, inplace=True)
turker4.reset_index(drop=True, inplace=True)
turker5.reset_index(drop=True, inplace=True)
merged_df = pd.concat([turker1, turker2, turker3, turker4, turker5], axis=0, sort=False)
merged_df.reset_index(drop=True, inplace=True)
df = merged_df.sort_values(by='WorkerId')
df = df[['WorkerId', 'Answer.sentiment.label']]
print(tabulate(df[:20], tablefmt="rst", headers=df.columns))
merged_df2 = pd.concat([turker1, turker2], axis=0, sort=False)
df = pd.DataFrame({'Turker': merged_df['WorkerId'].tolist(),
'SENTIMENT': merged_df['Answer.sentiment.label'].tolist(),
'REVIEW': merged_df['HITId'].tolist() })
grouped = df.groupby('Turker')
values = grouped['REVIEW'].agg('sum')
id_df = grouped['SENTIMENT'].apply(lambda x: pd.Series(x.values)).unstack()
id_df = id_df.rename(columns={i: 'SENTIMENT{}'.format(i + 1) for i in range(id_df.shape[1])})
result = pd.concat([id_df, values], axis=1)
result_df = pd.DataFrame(result)
df = result_df.T.copy()
df = df[df.columns[1:4]]
print(tabulate(df[:10], tablefmt="rst", headers=df.columns))
t1 = result_df.T['A3EZ0H07TSDAPW'].tolist()
t2 = result_df.T['A2XFO0X6RCS98M'].tolist()
t3 = result_df.T['A681XM15AN28F'].tolist()
t4 = result_df.T['ARLGZWN6W91WD'].tolist()
t1[:-1][:5]
t2[:-1][:5]
t3[:5]
OK after all that work, we can finally calculate the kappa score between our first and second "most prolific" turkers
from sklearn.metrics import cohen_kappa_score
y1 = t1[:-1]
y2 = t2[:-1]
cohen_kappa_score(y1,y2)
annnnnd just to make sure, let's calculate the same score between third and fourth "most prolific" turkers
y3 = t3[:-1]
y4 = t4[:-1]
cohen_kappa_score(y3,y4)
Pretty sure a negative number isn't what we want... oh well. Can't worry about that because that's when the existential dread sinks in... like, why am I doing this right now? Why do I care so much? Why am I trying to calculate inter-rater reliability THIS way when this won't even be a measure I will use if/when I use turkers in the future? In the future, I will use the sample size itself to determine "reliability" -- e.g. If all N turkers agree on X, then it goes into the "good" pile, if not, then it goes back into the AMT pile until we have N turkers agreeing...Because the way AMT is set up right now, we won't be able to reliable calculate kappa when the number of HITS per turker is so varried. In order to get something truely accurate, I'd have to remove all the data that was only completed by M or fewer turkers and hope that the prolific turkers worked on the same ones and then compare those (which is exactly what I did below but seriously WHY WHY WHY.)
new_turker_ids = pd.factorize(turker_clean_test['WorkerId'].tolist())
t_ids = ['T_' + str(id) for id in new_turker_ids[0]]
t_ids[:5]
turker_clean_test['T_ID'] = t_ids
# turker_clean_test[:5]
turker_clean_test['sentiment'] = turker_clean_test.apply(lambda x: x['Answer.sentiment.label'][0], axis=1)
# turker_clean_test[:5]
Annnnd here we are... small and clean. This DID actually help my brain a bit... Noted for next time.
even_cleaner_df = turker_clean_test[['ReviewID', 'T_ID', 'sentiment']]
df = even_cleaner_df[:5]
print(tabulate(df[:10], tablefmt="rst", headers=df.columns))
df = pd.DataFrame({'Turker': even_cleaner_df['T_ID'].tolist(),
'SENTIMENT': even_cleaner_df['sentiment'].tolist(),
'REVIEW': even_cleaner_df['ReviewID'].tolist() })
grouped = df.groupby('Turker')
values = grouped['REVIEW'].agg('sum')
id_df = grouped['SENTIMENT'].apply(lambda x: pd.Series(x.values)).unstack()
id_df = id_df.rename(columns={i: 'REVIEW{}'.format(i + 1) for i in range(id_df.shape[1])})
result = pd.concat([id_df, values], axis=1)
result_df = pd.DataFrame(result)
df = result_df.T[:5]
df = df[df.columns[1:8]]
print(tabulate(df[:10], tablefmt="rst", headers=df.columns))
And turn it into a data frame cuz why not?!
df = pd.DataFrame({'Turker': even_cleaner_df['T_ID'].tolist(),
'SENTIMENT': even_cleaner_df['ReviewID'].tolist(),
'REVIEW': even_cleaner_df['sentiment'].tolist() })
grouped = df.groupby('Turker')
values = grouped['REVIEW'].agg('sum')
id_df = grouped['SENTIMENT'].apply(lambda x: pd.Series(x.values)).unstack()
id_df = id_df.rename(columns={i: 'REVIEW{}'.format(i + 1) for i in range(id_df.shape[1])})
result = pd.concat([id_df, values], axis=1)
result_df = pd.DataFrame(result)
# print(result_df.T[:5])
df = pd.DataFrame(result_df.T)
# df[:5]
I want every review on the left side and I want all 46 turkers on the top
df = pd.DataFrame({ 'review': even_cleaner_df['ReviewID']})
def get_array_of_reviews(turker, df):
a = ['nan']*98
df = even_cleaner_df[even_cleaner_df['T_ID'] == turker]
t_reviews = df['ReviewID'].tolist()
t_sentiment = df['sentiment'].tolist()
for index,review in enumerate(t_reviews):
a[review] = t_sentiment[index]
# print(t_reviews)
return a
sparse_df = even_cleaner_df.copy()
sparse_df['big_array'] = sparse_df.apply(lambda x: get_array_of_reviews(x['T_ID'], even_cleaner_df), axis=1)
t0 = even_cleaner_df[even_cleaner_df['T_ID'] == 'T_0']
df = t0
print(tabulate(df[:10], tablefmt="rst", headers=df.columns))
sparse_df['big_array_sm'] = sparse_df.apply(lambda x: x['big_array'][:5], axis=1)
df = sparse_df[['ReviewID', 'T_ID','sentiment', 'big_array_sm']]
print(tabulate(df[:10], tablefmt="rst", headers=df.columns))
t0 = sparse_df[sparse_df['T_ID'] == 'T_0']
sparse_df['big_array'][sparse_df['T_ID'] == 'T_2'].tolist()[0][:5]
y1 = sparse_df['big_array'][sparse_df['T_ID'] == 'T_0'].tolist()[0]
y2 = sparse_df['big_array'][sparse_df['T_ID'] == 'T_1'].tolist()[0]
cohen_kappa_score(y1,y2)
def calculate_kappa(num):
y1 = sparse_df['big_array'][sparse_df['T_ID'] == 'T_'+str(num)].tolist()[0]
y2 = sparse_df['big_array'][sparse_df['T_ID'] == 'T_'+str(num + 1)].tolist()[0]
return cohen_kappa_score(y1,y2)
kappas = [calculate_kappa(num) for num in range(16)]
kappas
TL;DR: Calculating kappa and inter-rater reliability when there are multiple reviewers is challenging and deserves more delibrate time and study.
While computers have advanced in leaps and bounds over the past several decades, it’s clear that there are tasks that humans still perform better than machines. We know, for instance, that horseradish doesn’t belong in brownie recipes. We can tell if a tweet is sarcastic, or identify whether a photo depicts a chihuahua or a muffin. Some might say that machines can’t perform these tasks reliably because they aren’t “smart enough” yet. If intelligence is defined as the sum total of everything we’ve ever learned, then this assessment is accurate.
However, this does not mean that machines will never be able to perform tasks like these. In reality, computers simply haven't been given enough data to determine that the blueberries in that muffin are not, in fact, chihuahua eyeballs. Just as a small child labels every four-legged creature a “doggie” until she has lived long enough to collect more data (“This four-legged creature is always bigger than a dog and makes a totally different noise! I’ve also noticed that the grownups refer to it as a ‘horse’”), the computer is simply at a data disadvantage.
The solution, then, is expose the computer to more data, just like the child. This is exactly what Amazon Mechanical Turk is doing. Thanks to the “artificial” artificial intelligence of turkers, computers can process massive amounts of “gut feeling” data that will eventually enable them to distinguish between a chihuahua and a muffin as well as (or better than) humans.