Sentiment Analysis

TextBlob + Vader + NLTK + Naive Bayes

via this tutorial |10-6-19

In [1]:
from textblob import TextBlob
from IPython.display import display, HTML
import os
import pandas as pd
import numpy as np
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
In [2]:
def get_data_from_files(path):
    directory = os.listdir(path)
    results = []
    for file in directory:
        f=open(path+file)
        results.append(f.read())
        f.close()
    return results

# HW 1
neg_k = get_data_from_files('AI_NEG/')
pos_k = get_data_from_files('AI_POS/')
neg_a = get_data_from_files('NEG/')
pos_a = get_data_from_files('POS/')

# HW2
neg_cornell = get_data_from_files('neg_cornell/')
pos_cornell = get_data_from_files('pos_cornell/')

# HW3
neg_dirty = get_data_from_files('NEG_dirty/')
pos_dirty = get_data_from_files('POS_dirty/')
neg_joker = get_data_from_files('NEG_JK/')
pos_joker = get_data_from_files('POS_JK/')

# HW4
neg_hw4 = get_data_from_files('neg_hw4/')
pos_hw4 = get_data_from_files('pos_hw4/')

# HW4
false_lie_hw4 = get_data_from_files('hw4_lie_false/')
true_lie_hw4 = get_data_from_files('hw4_lie_true/')

TEXT BLOB

In [3]:
def get_pn(num):
    return 'neg' if num < 0 else 'pos'

def get_sentiment(array, label):
    blobs = [[TextBlob(text), text] for text in array]
    return ([{'label': label,
              'prediction': get_pn(obj.sentiment.polarity),
              'sentiment': obj.sentiment.polarity,
              'length': len(text), 
              'excerpt': text[:50]} for obj,text in blobs])

CASE STUDY 1: Kendra's Data

In [46]:
df_n = pd.DataFrame(get_sentiment(neg_k, 'neg'))
df_p = pd.DataFrame(get_sentiment(pos_k, 'pos'))

df_n['accurate'] = np.where(df_n['label'] == df_n['prediction'], 'yes', 'no')
df_p['accurate'] = np.where(df_p['label'] == df_p['prediction'], 'yes', 'no')

display(df_n)
display(df_p)

sum_correct_n = (df_n['accurate']=='yes').sum()
sum_correct_p = (df_p['accurate']=='yes').sum()

print('CORRECT PREDICT FALSE:', sum_correct_n, 'out of', len(df_n), sum_correct_n/len(df_n))
print('CORRECT PREDICT TRUE:', sum_correct_p, 'out of', len(df_p), sum_correct_p/len(df_p))
label prediction sentiment length excerpt accurate
0 neg neg -0.157143 76 WHERE ARE THE JOBS?! OH THAT'S RIGHT. ARTIFICI... yes
1 neg neg -0.750000 96 How can we trust Artificial Intelligence to dr... yes
2 neg neg -0.775000 31 I hate artificial intelligence! yes
3 neg neg -0.750000 47 My dog is terrified by artificial intelligence! yes
4 neg neg -0.750000 68 Artificial intelligence is going to melt the b... yes
label prediction sentiment length excerpt accurate
0 pos neg -0.112500 65 My dog is excited by the advancements in artif... no
1 pos neg -0.075000 133 I'm excited for my child to grow up and have t... no
2 pos neg -0.125000 31 I love artificial intelligence! no
3 pos neg -0.300000 121 Order my groceries, pay my taxes, take my kids... no
4 pos neg -0.133333 116 I'm grateful every day that my child will like... no
CORRECT PREDICT FALSE: 5 out of 5 1.0
CORRECT PREDICT TRUE: 0 out of 5 0.0

CASE STUDY 2: Ami's Data

In [45]:
df_n = pd.DataFrame(get_sentiment(neg_a, 'neg'))
df_p = pd.DataFrame(get_sentiment(pos_a, 'pos'))

import numpy as np
df_n['accurate'] = np.where(df_n['label'] == df_n['prediction'], 'yes', 'no')
df_p['accurate'] = np.where(df_p['label'] == df_p['prediction'], 'yes', 'no')

display(df_n)
display(df_p)

sum_correct_n = (df_n['accurate']=='yes').sum()
sum_correct_p = (df_p['accurate']=='yes').sum()

print('CORRECT PREDICT FALSE:', sum_correct_n, 'out of', len(df_n), sum_correct_n/len(df_n))
print('CORRECT PREDICT TRUE:', sum_correct_p, 'out of', len(df_p), sum_correct_p/len(df_p))
label prediction sentiment length excerpt accurate
0 neg neg -0.054577 3554 that's exactly how long the movie felt to me .... yes
1 neg pos 0.025467 2929 " quest for camelot " is warner bros . ' firs... no
2 neg pos 0.003334 3365 so ask yourself what " 8mm " ( " eight millime... no
3 neg pos 0.022925 4418 synopsis : a mentally unstable man undergoing ... no
4 neg pos 0.043234 3911 capsule : in 2176 on the planet mars police ta... no
label prediction sentiment length excerpt accurate
0 pos pos 0.023663 4227 films adapted from comic books have had plenty... yes
1 pos pos 0.131092 2421 you've got mail works alot better than it dese... yes
2 pos pos 0.110626 6092 " jaws " is a rare film that grabs your atten... yes
3 pos pos 0.103847 4096 every now and then a movie comes along from a ... yes
4 pos neg -0.070151 3898 moviemaking is a lot like being the general ma... no
CORRECT PREDICT FALSE: 1 out of 5 0.2
CORRECT PREDICT TRUE: 4 out of 5 0.8

CASE STUDY 3: Cornell Data

In [44]:
df_n = pd.DataFrame(get_sentiment(neg_cornell, 'neg'))
df_p = pd.DataFrame(get_sentiment(pos_cornell, 'pos'))

import numpy as np
df_n['accurate'] = np.where(df_n['label'] == df_n['prediction'], 'yes', 'no')
df_p['accurate'] = np.where(df_p['label'] == df_p['prediction'], 'yes', 'no')

display(df_n[:5])
display(df_p[:5])

sum_correct_n = (df_n['accurate']=='yes').sum()
sum_correct_p = (df_p['accurate']=='yes').sum()

print('CORRECT PREDICT FALSE:', sum_correct_n, 'out of', len(df_n), sum_correct_n/len(df_n))
print('CORRECT PREDICT TRUE:', sum_correct_p, 'out of', len(df_p), sum_correct_p/len(df_p))
label prediction sentiment length excerpt accurate
0 neg pos 0.026240 5953 bad . bad . \nbad . \nthat one word seems to p... no
1 neg pos 0.076040 3396 isn't it the ultimate sign of a movie's cinema... no
2 neg neg -0.128733 2762 " gordy " is not a movie , it is a 90-minute-... yes
3 neg neg -0.000485 3840 disconnect the phone line . \ndon't accept the... yes
4 neg pos 0.122770 2270 when robert forster found himself famous again... no
label prediction sentiment length excerpt accurate
0 pos pos 0.221173 4662 assume nothing . \nthe phrase is perhaps one o... yes
1 pos pos 0.089736 3839 plot : derek zoolander is a male model . \nhe ... yes
2 pos pos 0.206743 9380 i actually am a fan of the original 1961 or so... yes
3 pos pos 0.141905 2407 a movie that's been as highly built up as the ... yes
4 pos pos 0.176332 1840 " good will hunting " is two movies in one : ... yes
CORRECT PREDICT FALSE: 229 out of 1000 0.229
CORRECT PREDICT TRUE: 971 out of 1000 0.971

CASE STUDY 4: Dirty Data

In [43]:
df_n = pd.DataFrame(get_sentiment(neg_dirty, 'neg'))
df_p = pd.DataFrame(get_sentiment(pos_dirty, 'pos'))

df_n['accurate'] = np.where(df_n['label'] == df_n['prediction'], 'yes', 'no')
df_p['accurate'] = np.where(df_p['label'] == df_p['prediction'], 'yes', 'no')

display(df_n[:5])
display(df_p[:5])

sum_correct_n = (df_n['accurate']=='yes').sum()
sum_correct_p = (df_p['accurate']=='yes').sum()

print('CORRECT PREDICT FALSE:', sum_correct_n, 'out of', len(df_n), sum_correct_n/len(df_n))
print('CORRECT PREDICT TRUE:', sum_correct_p, 'out of', len(df_p), sum_correct_p/len(df_p))
label prediction sentiment length excerpt accurate
0 neg neg -0.004665 3777 by starring in amy heckerlings clueless two ... yes
1 neg pos 0.119184 3639 i have little against remakes and updates of o... no
2 neg pos 0.100886 4247 i cant recall a previous film experience where... no
3 neg pos 0.097526 4308 the tagline for this film is : some houses ar... no
4 neg pos 0.048745 5175 warner brothers ; rated pg-13 ( mild violence ... no
label prediction sentiment length excerpt accurate
0 pos pos 0.134641 4584 for the first reel of girls town , you just ca... yes
1 pos pos 0.137134 3102 field of dreams almost defies description . al... yes
2 pos pos 0.181355 3521 meet joe black is your classic boy-meets-girl ... yes
3 pos pos 0.104101 2192 an indian runner was more than a courier . he ... yes
4 pos pos 0.204967 4955 every once in a while , when an exceptional fa... yes
CORRECT PREDICT FALSE: 227 out of 1000 0.227
CORRECT PREDICT TRUE: 972 out of 1000 0.972

CASE STUDY 5: Joker Review Data

In [42]:
df_n = pd.DataFrame(get_sentiment(neg_joker, 'neg'))
df_p = pd.DataFrame(get_sentiment(pos_joker, 'pos'))

df_n['accurate'] = np.where(df_n['label'] == df_n['prediction'], 'yes', 'no')
df_p['accurate'] = np.where(df_p['label'] == df_p['prediction'], 'yes', 'no')

display(df_n[:5])
display(df_p[:5])

sum_correct_n = (df_n['accurate']=='yes').sum()
sum_correct_p = (df_p['accurate']=='yes').sum()

print('CORRECT PREDICT FALSE:', sum_correct_n, 'out of', len(df_n), sum_correct_n/len(df_n))
print('CORRECT PREDICT TRUE:', sum_correct_p, 'out of', len(df_p), sum_correct_p/len(df_p))
label prediction sentiment length excerpt accurate
0 neg pos 0.152083 1734 Missed Opportunity\nI had been very excited t... no
1 neg neg -0.001852 328 5/5 for Phoenix's acting..\nI don't think the... yes
2 neg pos 0.200000 145 Everyone praised an overrated movie.\nOverrat... no
3 neg neg -0.038095 350 What idiotic FIlm\nI can say that Phoenix is ... yes
4 neg pos 0.126398 711 Terrible\nThe only thing good about this movi... no
label prediction sentiment length excerpt accurate
0 pos pos 0.107162 5554 funny like a clown\nGreetings again from the ... yes
1 pos pos 0.014881 473 Only certain people can relate\nThis is a mov... yes
2 pos pos 0.008294 2509 "That's Life."\nIn an era of cinema so satura... yes
3 pos pos 0.036939 4022 Best DC movie since The Dark Knight Rises\nDC... yes
4 pos neg -0.017162 1430 unbelievable, unrelatable, a bit boring to be... no
CORRECT PREDICT FALSE: 64 out of 123 0.5203252032520326
CORRECT PREDICT TRUE: 114 out of 123 0.926829268292683

CASE STUDY 6: HW4 [Sentiment]

In [41]:
df_n = pd.DataFrame(get_sentiment(neg_hw4, 'neg'))
df_p = pd.DataFrame(get_sentiment(pos_hw4, 'pos'))

df_n['accurate'] = np.where(df_n['label'] == df_n['prediction'], 'yes', 'no')
df_p['accurate'] = np.where(df_p['label'] == df_p['prediction'], 'yes', 'no')

display(df_n[:5])
display(df_p[:5])

sum_correct_n = (df_n['accurate']=='yes').sum()
sum_correct_p = (df_p['accurate']=='yes').sum()

print('CORRECT PREDICT FALSE:', sum_correct_n, 'out of', len(df_n), sum_correct_n/len(df_n))
print('CORRECT PREDICT TRUE:', sum_correct_p, 'out of', len(df_p), sum_correct_p/len(df_p))
label prediction sentiment length excerpt accurate
0 neg neg -0.273958 251 I went to XYZ restaurant last week and I was v... yes
1 neg pos 0.083333 359 In each of the diner dish there are at least o... no
2 neg neg -0.134722 748 This is the last place you would want to dine ... yes
3 neg neg -0.166667 378 I went to this restaurant where I had ordered ... yes
4 neg pos 0.152455 381 I went there with two friends at 6pm. Long que... no
label prediction sentiment length excerpt accurate
0 pos pos 0.626786 132 This restaurant ROCKS! I mean the food is grea... yes
1 pos pos 0.500000 441 Stronghearts cafe is the BEST! The owners have... yes
2 pos pos 0.480208 485 I went to cruise dinner in NYC with Spirit Cru... yes
3 pos pos 0.240278 404 Halos is home. I have been here numerous times... yes
4 pos pos 0.552083 324 The best restaurant I have ever been was a sma... yes
CORRECT PREDICT FALSE: 26 out of 46 0.5652173913043478
CORRECT PREDICT TRUE: 46 out of 46 1.0

CASE STUDY 7: HW4 [Deception]

In [40]:
df_n = pd.DataFrame(get_sentiment(false_lie_hw4, 'neg'))
df_p = pd.DataFrame(get_sentiment(true_lie_hw4, 'pos'))

df_n['accurate'] = np.where(df_n['label'] == df_n['prediction'], 'yes', 'no')
df_p['accurate'] = np.where(df_p['label'] == df_p['prediction'], 'yes', 'no')

display(df_n[:5])
display(df_p[:5])

sum_correct_n = (df_n['accurate']=='yes').sum()
sum_correct_p = (df_p['accurate']=='yes').sum()

print('CORRECT PREDICT FALSE:', sum_correct_n, 'out of', len(df_n), sum_correct_n/len(df_n))
print('CORRECT PREDICT TRUE:', sum_correct_p, 'out of', len(df_p), sum_correct_p/len(df_p))
label prediction sentiment length excerpt accurate
0 neg pos 0.442752 386 Gannon’s Isle Ice Cream served the best ice cr... no
1 neg pos 0.197500 340 Hibachi the grill is one of my favorite restau... no
2 neg pos 0.353912 790 RIM KAAP One of the best Thai restaurants in t... no
3 neg pos 0.578788 391 It is a France restaurant which has Michelin t... no
4 neg pos 0.331373 710 Its hard to pick a favorite dining experience ... no
label prediction sentiment length excerpt accurate
0 pos pos 0.000000 1 ? yes
1 pos pos 0.236833 289 Twin Trees Cicero NY HUGE salad bar and high q... yes
2 pos neg -0.249762 519 The worst restaurant that I have ever eaten in... no
3 pos pos 0.000000 1 ? yes
4 pos pos 0.019481 234 I have been to a Asian restaurant in New York ... yes
CORRECT PREDICT FALSE: 14 out of 46 0.30434782608695654
CORRECT PREDICT TRUE: 34 out of 46 0.7391304347826086

VADER

In [13]:
def get_pn(num):
    return 'neg' if num < 0 else 'pos'

def get_vader_scores(array, label):
    vader_array = []
    for sentence in array:
        ss = sid.polarity_scores(sentence)
        vader_array.append({'label': label,
                            'prediction': get_pn(ss['compound']),
                            'compound': ss['compound'], 
                            'excerpt': sentence[:50]})
    return vader_array
In [14]:
neg_k
Out[14]:
["WHERE ARE THE JOBS?! OH THAT'S RIGHT. ARTIFICIAL INTELLIGENCE TOOK OUR JOBS.",
 "How can we trust Artificial Intelligence to drive our cars when they can't even hack a captcha?!",
 'I hate artificial intelligence!',
 'My dog is terrified by artificial intelligence!',
 'Artificial intelligence is going to melt the brains of our children!']

CASE STUDY 1: Kendra's Data

In [38]:
df_n = pd.DataFrame(get_vader_scores(neg_k, 'neg'))
df_p = pd.DataFrame(get_vader_scores(pos_k, 'pos'))

df_n['accurate'] = np.where(df_n['label'] == df_n['prediction'], 'yes', 'no')
df_p['accurate'] = np.where(df_p['label'] == df_p['prediction'], 'yes', 'no')

display(df_n)
display(df_p)

sum_correct_n = (df_n['accurate']=='yes').sum()
sum_correct_p = (df_p['accurate']=='yes').sum()

print('CORRECT PREDICT FALSE:', sum_correct_n, 'out of', len(df_n), sum_correct_n/len(df_n))
print('CORRECT PREDICT TRUE:', sum_correct_p, 'out of', len(df_p), sum_correct_p/len(df_p))
label prediction compound excerpt accurate
0 neg pos 0.5255 WHERE ARE THE JOBS?! OH THAT'S RIGHT. ARTIFICI... no
1 neg pos 0.7712 How can we trust Artificial Intelligence to dr... no
2 neg neg -0.2244 I hate artificial intelligence! yes
3 neg neg -0.2942 My dog is terrified by artificial intelligence! yes
4 neg pos 0.5255 Artificial intelligence is going to melt the b... no
label prediction compound excerpt accurate
0 pos pos 0.6705 My dog is excited by the advancements in artif... yes
1 pos pos 0.8271 I'm excited for my child to grow up and have t... yes
2 pos pos 0.8221 I love artificial intelligence! yes
3 pos pos 0.8213 Order my groceries, pay my taxes, take my kids... yes
4 pos pos 0.8402 I'm grateful every day that my child will like... yes
CORRECT PREDICT FALSE: 2 out of 5 0.4
CORRECT PREDICT TRUE: 5 out of 5 1.0

CASE STUDY 2: Ami's Data

In [37]:
df_n = pd.DataFrame(get_vader_scores(neg_a, 'neg'))
df_p = pd.DataFrame(get_vader_scores(pos_a, 'pos'))

df_n['accurate'] = np.where(df_n['label'] == df_n['prediction'], 'yes', 'no')
df_p['accurate'] = np.where(df_p['label'] == df_p['prediction'], 'yes', 'no')

display(df_n)
display(df_p)

sum_correct_n = (df_n['accurate']=='yes').sum()
sum_correct_p = (df_p['accurate']=='yes').sum()

print('CORRECT PREDICT FALSE:', sum_correct_n, 'out of', len(df_n), sum_correct_n/len(df_n))
print('CORRECT PREDICT TRUE:', sum_correct_p, 'out of', len(df_p), sum_correct_p/len(df_p))
label prediction compound excerpt accurate
0 neg pos 0.7836 that's exactly how long the movie felt to me .... no
1 neg neg -0.8481 " quest for camelot " is warner bros . ' firs... yes
2 neg neg -0.9753 so ask yourself what " 8mm " ( " eight millime... yes
3 neg pos 0.6824 synopsis : a mentally unstable man undergoing ... no
4 neg neg -0.9879 capsule : in 2176 on the planet mars police ta... yes
label prediction compound excerpt accurate
0 pos neg -0.5887 films adapted from comic books have had plenty... no
1 pos pos 0.9964 you've got mail works alot better than it dese... yes
2 pos pos 0.9868 " jaws " is a rare film that grabs your atten... yes
3 pos pos 0.8825 every now and then a movie comes along from a ... yes
4 pos neg -0.3525 moviemaking is a lot like being the general ma... no
CORRECT PREDICT FALSE: 3 out of 5 0.6
CORRECT PREDICT TRUE: 3 out of 5 0.6

CASE STUDY 3: Cornell Data

In [47]:
df_n = pd.DataFrame(get_vader_scores(neg_cornell, 'neg'))
df_p = pd.DataFrame(get_vader_scores(pos_cornell, 'pos'))

df_n['accurate'] = np.where(df_n['label'] == df_n['prediction'], 'yes', 'no')
df_p['accurate'] = np.where(df_p['label'] == df_p['prediction'], 'yes', 'no')

display(df_n[:5])
display(df_p[:5])

sum_correct_n = (df_n['accurate']=='yes').sum()
sum_correct_p = (df_p['accurate']=='yes').sum()

print('CORRECT PREDICT FALSE:', sum_correct_n, 'out of', len(df_n), sum_correct_n/len(df_n))
print('CORRECT PREDICT TRUE:', sum_correct_p, 'out of', len(df_p), sum_correct_p/len(df_p))
label prediction compound excerpt accurate
0 neg pos 0.9695 bad . bad . \nbad . \nthat one word seems to p... no
1 neg pos 0.1722 isn't it the ultimate sign of a movie's cinema... no
2 neg neg -0.9970 " gordy " is not a movie , it is a 90-minute-... yes
3 neg pos 0.9861 disconnect the phone line . \ndon't accept the... no
4 neg pos 0.7445 when robert forster found himself famous again... no
label prediction compound excerpt accurate
0 pos pos 0.9985 assume nothing . \nthe phrase is perhaps one o... yes
1 pos pos 0.9853 plot : derek zoolander is a male model . \nhe ... yes
2 pos pos 0.9998 i actually am a fan of the original 1961 or so... yes
3 pos pos 0.9671 a movie that's been as highly built up as the ... yes
4 pos pos 0.9300 " good will hunting " is two movies in one : ... yes
CORRECT PREDICT FALSE: 445 out of 1000 0.445
CORRECT PREDICT TRUE: 828 out of 1000 0.828

CASE STUDY 4: Dirty Data

In [48]:
df_n = pd.DataFrame(get_vader_scores(neg_dirty, 'neg'))
df_p = pd.DataFrame(get_vader_scores(pos_dirty, 'pos'))

df_n['accurate'] = np.where(df_n['label'] == df_n['prediction'], 'yes', 'no')
df_p['accurate'] = np.where(df_p['label'] == df_p['prediction'], 'yes', 'no')

display(df_n[:5])
display(df_p[:5])

sum_correct_n = (df_n['accurate']=='yes').sum()
sum_correct_p = (df_p['accurate']=='yes').sum()

print('CORRECT PREDICT FALSE:', sum_correct_n, 'out of', len(df_n), sum_correct_n/len(df_n))
print('CORRECT PREDICT TRUE:', sum_correct_p, 'out of', len(df_p), sum_correct_p/len(df_p))
label prediction compound excerpt accurate
0 neg neg -0.9326 by starring in amy heckerlings clueless two ... yes
1 neg pos 0.8326 i have little against remakes and updates of o... no
2 neg pos 0.9491 i cant recall a previous film experience where... no
3 neg pos 0.9854 the tagline for this film is : some houses ar... no
4 neg neg -0.8077 warner brothers ; rated pg-13 ( mild violence ... yes
label prediction compound excerpt accurate
0 pos neg -0.9888 for the first reel of girls town , you just ca... no
1 pos pos 0.9885 field of dreams almost defies description . al... yes
2 pos pos 0.9806 meet joe black is your classic boy-meets-girl ... yes
3 pos neg -0.9614 an indian runner was more than a courier . he ... no
4 pos pos 0.9992 every once in a while , when an exceptional fa... yes
CORRECT PREDICT FALSE: 454 out of 1000 0.454
CORRECT PREDICT TRUE: 824 out of 1000 0.824

CASE STUDY 5: Joker Review Data

In [49]:
df_n = pd.DataFrame(get_vader_scores(neg_joker, 'neg'))
df_p = pd.DataFrame(get_vader_scores(pos_joker, 'pos'))

df_n['accurate'] = np.where(df_n['label'] == df_n['prediction'], 'yes', 'no')
df_p['accurate'] = np.where(df_p['label'] == df_p['prediction'], 'yes', 'no')

display(df_n[:5])
display(df_p[:5])

sum_correct_n = (df_n['accurate']=='yes').sum()
sum_correct_p = (df_p['accurate']=='yes').sum()

print('CORRECT PREDICT FALSE:', sum_correct_n, 'out of', len(df_n), sum_correct_n/len(df_n))
print('CORRECT PREDICT TRUE:', sum_correct_p, 'out of', len(df_p), sum_correct_p/len(df_p))
label prediction compound excerpt accurate
0 neg pos 0.7501 Missed Opportunity\nI had been very excited t... no
1 neg pos 0.7184 5/5 for Phoenix's acting..\nI don't think the... no
2 neg pos 0.7269 Everyone praised an overrated movie.\nOverrat... no
3 neg neg -0.6698 What idiotic FIlm\nI can say that Phoenix is ... yes
4 neg pos 0.7184 Terrible\nThe only thing good about this movi... no
label prediction compound excerpt accurate
0 pos pos 0.9976 funny like a clown\nGreetings again from the ... yes
1 pos pos 0.9231 Only certain people can relate\nThis is a mov... yes
2 pos pos 0.9796 "That's Life."\nIn an era of cinema so satura... yes
3 pos neg -0.9586 Best DC movie since The Dark Knight Rises\nDC... no
4 pos neg -0.8813 unbelievable, unrelatable, a bit boring to be... no
CORRECT PREDICT FALSE: 68 out of 123 0.5528455284552846
CORRECT PREDICT TRUE: 94 out of 123 0.7642276422764228

CASE STUDY 6: HW4 [Sentiment]

In [34]:
df_n = pd.DataFrame(get_vader_scores(neg_hw4, 'neg'))
df_p = pd.DataFrame(get_vader_scores(pos_hw4, 'pos'))

df_n['accurate'] = np.where(df_n['label'] == df_n['prediction'], 'yes', 'no')
df_p['accurate'] = np.where(df_p['label'] == df_p['prediction'], 'yes', 'no')

display(df_n[:5])
display(df_p[:5])

sum_correct_n = (df_n['accurate']=='yes').sum()
sum_correct_p = (df_p['accurate']=='yes').sum()

print('CORRECT PREDICT FALSE:', sum_correct_n, 'out of', len(df_n), sum_correct_n/len(df_n))
print('CORRECT PREDICT TRUE:', sum_correct_p, 'out of', len(df_p), sum_correct_p/len(df_p))
label prediction compound excerpt accurate
0 neg neg -0.6807 I went to XYZ restaurant last week and I was v... yes
1 neg neg -0.6329 In each of the diner dish there are at least o... yes
2 neg pos 0.5161 This is the last place you would want to dine ... no
3 neg neg -0.5423 I went to this restaurant where I had ordered ... yes
4 neg pos 0.8842 I went there with two friends at 6pm. Long que... no
label prediction compound excerpt accurate
0 pos pos 0.9840 This restaurant ROCKS! I mean the food is grea... yes
1 pos pos 0.9702 Stronghearts cafe is the BEST! The owners have... yes
2 pos pos 0.9106 I went to cruise dinner in NYC with Spirit Cru... yes
3 pos pos 0.9349 Halos is home. I have been here numerous times... yes
4 pos pos 0.9686 The best restaurant I have ever been was a sma... yes
CORRECT PREDICT FALSE: 26 out of 46 0.5652173913043478
CORRECT PREDICT TRUE: 45 out of 46 0.9782608695652174

CASE STUDY 7: HW4 [Deception]

In [33]:
df_n = pd.DataFrame(get_vader_scores(false_lie_hw4, 'neg'))
df_p = pd.DataFrame(get_vader_scores(true_lie_hw4, 'pos'))

df_n['accurate'] = np.where(df_n['label'] == df_n['prediction'], 'yes', 'no')
df_p['accurate'] = np.where(df_p['label'] == df_p['prediction'], 'yes', 'no')

display(df_n[:5])
display(df_p[:5])

sum_correct_n = (df_n['accurate']=='yes').sum()
sum_correct_p = (df_p['accurate']=='yes').sum()

print('CORRECT PREDICT FALSE:', sum_correct_n, 'out of', len(df_n), sum_correct_n/len(df_n))
print('CORRECT PREDICT TRUE:', sum_correct_p, 'out of', len(df_p), sum_correct_p/len(df_p))
label prediction compound excerpt accurate
0 neg pos 0.9328 Gannon’s Isle Ice Cream served the best ice cr... no
1 neg pos 0.8885 Hibachi the grill is one of my favorite restau... no
2 neg pos 0.9915 RIM KAAP One of the best Thai restaurants in t... no
3 neg pos 0.8625 It is a France restaurant which has Michelin t... no
4 neg pos 0.9360 Its hard to pick a favorite dining experience ... no
label prediction compound excerpt accurate
0 pos pos 0.0000 ? yes
1 pos pos 0.8321 Twin Trees Cicero NY HUGE salad bar and high q... yes
2 pos neg -0.8641 The worst restaurant that I have ever eaten in... no
3 pos pos 0.0000 ? yes
4 pos pos 0.5267 I have been to a Asian restaurant in New York ... yes
CORRECT PREDICT FALSE: 13 out of 46 0.2826086956521739
CORRECT PREDICT TRUE: 32 out of 46 0.6956521739130435

NLTK with NaiveBayes

In [26]:
from nltk.classify import NaiveBayesClassifier
from nltk.tokenize import word_tokenize
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *

def get_tokens(sentence):
    tokens = word_tokenize(sentence)
    clean_tokens = [word.lower() for word in tokens if word.isalpha()]
    return clean_tokens

def get_nltk_train_test(array, label, num_train):
    tokens = [get_tokens(sentence) for sentence in array]
    docs = [(sent, label) for sent in tokens]
    train_docs = docs[:num_train]
    test_docs = docs[num_train:len(array)]
    return [train_docs, test_docs]


def get_nltk_NB(NEG_DATA, POS_DATA, num_train):
    train_neg, test_neg = get_nltk_train_test(NEG_DATA, 'neg', num_train)
    train_pos, test_pos = get_nltk_train_test(POS_DATA, 'pos', num_train)

    training_docs = train_neg + train_pos
    testing_docs = test_neg + test_pos

    sentim_analyzer = SentimentAnalyzer()
    all_words_neg = sentim_analyzer.all_words([mark_negation(doc) for doc in training_docs])
    unigram_feats = sentim_analyzer.unigram_word_feats(all_words_neg)
    sentim_analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_feats)
    training_set = sentim_analyzer.apply_features(training_docs)
    test_set = sentim_analyzer.apply_features(testing_docs)

    trainer = NaiveBayesClassifier.train
    classifier = sentim_analyzer.train(trainer, training_set)
    
    results = []
    for key,value in sorted(sentim_analyzer.evaluate(test_set).items()):
        print('{0}: {1}'.format(key,value))

CASE STUDY 1: Kendra's Data

In [50]:
get_nltk_NB(neg_k, pos_k, 4)
Training classifier
Evaluating NaiveBayesClassifier results...
Accuracy: 1.0
F-measure [neg]: 1.0
F-measure [pos]: 1.0
Precision [neg]: 1.0
Precision [pos]: 1.0
Recall [neg]: 1.0
Recall [pos]: 1.0

CASE STUDY 2: Ami's Data

In [51]:
get_nltk_NB(neg_a, pos_a, 4)
Training classifier
Evaluating NaiveBayesClassifier results...
Accuracy: 0.5
F-measure [neg]: 0.6666666666666666
F-measure [pos]: None
Precision [neg]: 0.5
Precision [pos]: None
Recall [neg]: 1.0
Recall [pos]: 0.0

CASE STUDY 3: Cornell's Data

In [52]:
get_nltk_NB(neg_cornell, pos_cornell, 800)
Training classifier
Evaluating NaiveBayesClassifier results...
Accuracy: 0.8125
F-measure [neg]: 0.8259860788863109
F-measure [pos]: 0.7967479674796748
Precision [neg]: 0.7705627705627706
Precision [pos]: 0.8698224852071006
Recall [neg]: 0.89
Recall [pos]: 0.735

CASE STUDY 4: Dirty Data

In [22]:
get_nltk_NB(neg_dirty, pos_dirty, 800)
Training classifier
Evaluating NaiveBayesClassifier results...
Accuracy: 0.7775
F-measure [neg]: 0.7944572748267898
F-measure [pos]: 0.757493188010899
Precision [neg]: 0.7381974248927039
Precision [pos]: 0.8323353293413174
Recall [neg]: 0.86
Recall [pos]: 0.695

CASE STUDY 5: Joker Review Data

In [16]:
get_nltk_NB(neg_joker, pos_joker, 86)
Training classifier
Evaluating NaiveBayesClassifier results...
Accuracy: 0.581081081081081
F-measure [neg]: 0.6593406593406593
F-measure [pos]: 0.456140350877193
Precision [neg]: 0.5555555555555556
Precision [pos]: 0.65
Recall [neg]: 0.8108108108108109
Recall [pos]: 0.35135135135135137

CASE STUDY 6: HW4 [Sentiment]

In [19]:
get_nltk_NB(neg_hw4, pos_hw4, 32)
Training classifier
Evaluating NaiveBayesClassifier results...
Accuracy: 0.75
F-measure [neg]: 0.6956521739130435
F-measure [pos]: 0.787878787878788
Precision [neg]: 0.8888888888888888
Precision [pos]: 0.6842105263157895
Recall [neg]: 0.5714285714285714
Recall [pos]: 0.9285714285714286

CASE STUDY 7: HW4 [Deception]

In [27]:
get_nltk_NB(false_lie_hw4, true_lie_hw4, 32)
Training classifier
Evaluating NaiveBayesClassifier results...
Accuracy: 0.5714285714285714
F-measure [neg]: 0.5714285714285714
F-measure [pos]: 0.5714285714285714
Precision [neg]: 0.5714285714285714
Precision [pos]: 0.5714285714285714
Recall [neg]: 0.5714285714285714
Recall [pos]: 0.5714285714285714
In [50]:
type(false_lie_hw4)
Out[50]:
list
In [51]:
false_lie_hw4[0]
Out[51]:
'Gannon’s Isle Ice Cream served the best ice cream and you better believe it! The place is ideally situated and it is easy to get too. The ice cream is delicious the best I had. There were so many varieties that I had trouble choosing it. I had the chocolate and raspberry. A weird combination but the smooth sweet chocolate combined with the sharp taste of raspberry was devine! Try it!'
In [ ]: