HW2: VECTORIZATION (Pandas style!)

STEP 1: Import ALL the things

Import libraries

In [2]:
##########################################
# NOTE: I'm toying with the idea of requiring the library just above 
# when I use it so it makes more sense in context
##########################################
# import os
# import pandas as pd
# from nltk.tokenize import word_tokenize, sent_tokenize
# from nltk.sentiment import SentimentAnalyzer
# from nltk.sentiment.util import *
# from nltk.probability import FreqDist
# from nltk.sentiment.vader import SentimentIntensityAnalyzer
# sid = SentimentIntensityAnalyzer()

Import data from files

In [1]:
import os
def get_data_from_files(path):
    directory = os.listdir(path)
    results = []
    for file in directory:
        f=open(path+file)
        results.append(f.read())
        f.close()
    return results

neg = get_data_from_files('../neg_cornell/')
pos = get_data_from_files('../pos_cornell/')

# neg = get_data_from_files('../neg_hw4/')
# pos = get_data_from_files('../pos_hw4/')

STEP 2: Prep Data

STEP 2a: Turn that fresh text into a pandas DF

In [2]:
import pandas as pd
neg_df = pd.DataFrame(neg)
pos_df = pd.DataFrame(pos)

STEP 2b: Label it

In [3]:
pos_df['PoN'] = 'P'
neg_df['PoN'] = 'N'

STEP 2c: Combine the dfs

In [4]:
all_df = neg_df.append(pos_df)
In [5]:
all_df
Out[5]:
0 PoN
0 bad . bad . \nbad . \nthat one word seems to p... N
1 isn't it the ultimate sign of a movie's cinema... N
2 " gordy " is not a movie , it is a 90-minute-... N
3 disconnect the phone line . \ndon't accept the... N
4 when robert forster found himself famous again... N
... ... ...
995 one of the funniest carry on movies and the th... P
996 i remember making a pact , right after `patch ... P
997 barely scrapping by playing at a nyc piano bar... P
998 if the current trends of hollywood filmmaking ... P
999 capsule : the director of cure brings a weird ... P

2000 rows × 2 columns

STEP 3: TOKENIZE (and clean)!!

In [6]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *
In [7]:
## Came back and added sentences for tokinization for "Summary experiment"
def get_sentence_tokens(review):
    return sent_tokenize(review)
    
all_df['sentences'] = all_df.apply(lambda x: get_sentence_tokens(x[0]), axis=1)
all_df['num_sentences'] = all_df.apply(lambda x: len(x['sentences']), axis=1)
In [8]:
def get_tokens(sentence):
    tokens = word_tokenize(sentence)
    clean_tokens = [word.lower() for word in tokens if word.isalpha()]
    return clean_tokens

all_df['tokens'] = all_df.apply(lambda x: get_tokens(x[0]), axis=1)
all_df['num_tokens'] = all_df.apply(lambda x: len(x['tokens']), axis=1)
In [9]:
all_df
Out[9]:
0 PoN sentences num_sentences tokens num_tokens
0 bad . bad . \nbad . \nthat one word seems to p... N [bad ., bad ., bad ., that one word seems to p... 67 [bad, bad, bad, that, one, word, seems, to, pr... 1071
1 isn't it the ultimate sign of a movie's cinema... N [isn't it the ultimate sign of a movie's cinem... 32 [is, it, the, ultimate, sign, of, a, movie, ci... 553
2 " gordy " is not a movie , it is a 90-minute-... N [ " gordy " is not a movie , it is a 90-minute... 23 [gordy, is, not, a, movie, it, is, a, sesame, ... 478
3 disconnect the phone line . \ndon't accept the... N [disconnect the phone line ., don't accept the... 37 [disconnect, the, phone, line, do, accept, the... 604
4 when robert forster found himself famous again... N [when robert forster found himself famous agai... 29 [when, robert, forster, found, himself, famous... 386
... ... ... ... ... ... ...
995 one of the funniest carry on movies and the th... P [one of the funniest carry on movies and the t... 25 [one, of, the, funniest, carry, on, movies, an... 434
996 i remember making a pact , right after `patch ... P [i remember making a pact , right after `patch... 40 [i, remember, making, a, pact, right, after, p... 652
997 barely scrapping by playing at a nyc piano bar... P [barely scrapping by playing at a nyc piano ba... 23 [barely, scrapping, by, playing, at, a, nyc, p... 345
998 if the current trends of hollywood filmmaking ... P [if the current trends of hollywood filmmaking... 34 [if, the, current, trends, of, hollywood, film... 730
999 capsule : the director of cure brings a weird ... P [capsule : the director of cure brings a weird... 45 [capsule, the, director, of, cure, brings, a, ... 641

2000 rows × 6 columns

STEP 4: Remove Stopwords

In [10]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
def remove_stopwords(sentence):
    filtered_text = []
    for word in sentence:
        if word not in stop_words:
            filtered_text.append(word)
    return filtered_text
all_df['no_sw'] = all_df.apply(lambda x: remove_stopwords(x['tokens']),axis=1)
all_df['num_no_sw'] = all_df.apply(lambda x: len(x['no_sw']),axis=1)
In [11]:
all_df
Out[11]:
0 PoN sentences num_sentences tokens num_tokens no_sw num_no_sw
0 bad . bad . \nbad . \nthat one word seems to p... N [bad ., bad ., bad ., that one word seems to p... 67 [bad, bad, bad, that, one, word, seems, to, pr... 1071 [bad, bad, bad, one, word, seems, pretty, much... 515
1 isn't it the ultimate sign of a movie's cinema... N [isn't it the ultimate sign of a movie's cinem... 32 [is, it, the, ultimate, sign, of, a, movie, ci... 553 [ultimate, sign, movie, cinematic, ineptitude,... 297
2 " gordy " is not a movie , it is a 90-minute-... N [ " gordy " is not a movie , it is a 90-minute... 23 [gordy, is, not, a, movie, it, is, a, sesame, ... 478 [gordy, movie, sesame, street, skit, bad, one,... 239
3 disconnect the phone line . \ndon't accept the... N [disconnect the phone line ., don't accept the... 37 [disconnect, the, phone, line, do, accept, the... 604 [disconnect, phone, line, accept, charges, any... 323
4 when robert forster found himself famous again... N [when robert forster found himself famous agai... 29 [when, robert, forster, found, himself, famous... 386 [robert, forster, found, famous, appearing, ja... 185
... ... ... ... ... ... ... ... ...
995 one of the funniest carry on movies and the th... P [one of the funniest carry on movies and the t... 25 [one, of, the, funniest, carry, on, movies, an... 434 [one, funniest, carry, movies, third, medical,... 241
996 i remember making a pact , right after `patch ... P [i remember making a pact , right after `patch... 40 [i, remember, making, a, pact, right, after, p... 652 [remember, making, pact, right, patch, adams, ... 361
997 barely scrapping by playing at a nyc piano bar... P [barely scrapping by playing at a nyc piano ba... 23 [barely, scrapping, by, playing, at, a, nyc, p... 345 [barely, scrapping, playing, nyc, piano, bar, ... 177
998 if the current trends of hollywood filmmaking ... P [if the current trends of hollywood filmmaking... 34 [if, the, current, trends, of, hollywood, film... 730 [current, trends, hollywood, filmmaking, conti... 428
999 capsule : the director of cure brings a weird ... P [capsule : the director of cure brings a weird... 45 [capsule, the, director, of, cure, brings, a, ... 641 [capsule, director, cure, brings, weird, compl... 340

2000 rows × 8 columns

STEP 5: Create a Frequency Distribution

In [12]:
from nltk.probability import FreqDist
def get_most_common(tokens):
    fdist = FreqDist(tokens)
    return fdist.most_common(12)
all_df['topwords_unfil'] = all_df.apply(lambda x: get_most_common(x['tokens']),axis=1)
In [13]:
def get_most_common(tokens):
    fdist = FreqDist(tokens)
    return fdist.most_common(12)
all_df['topwords_fil'] = all_df.apply(lambda x: get_most_common(x['no_sw']),axis=1)
In [14]:
def get_fdist(tokens):
    return (FreqDist(tokens))
    
all_df['freq_dist'] = all_df.apply(lambda x: get_fdist(x['no_sw']),axis=1)
all_df['freq_dist_unfil'] = all_df.apply(lambda x: get_fdist(x['tokens']),axis=1)
In [15]:
all_df
Out[15]:
0 PoN sentences num_sentences tokens num_tokens no_sw num_no_sw topwords_unfil topwords_fil freq_dist freq_dist_unfil
0 bad . bad . \nbad . \nthat one word seems to p... N [bad ., bad ., bad ., that one word seems to p... 67 [bad, bad, bad, that, one, word, seems, to, pr... 1071 [bad, bad, bad, one, word, seems, pretty, much... 515 [(the, 60), (a, 35), (to, 34), (of, 24), (this... [(movie, 17), (bad, 8), (one, 7), (meyer, 6), ... {'bad': 8, 'one': 7, 'word': 1, 'seems': 1, 'p... {'bad': 8, 'that': 19, 'one': 7, 'word': 1, 's...
1 isn't it the ultimate sign of a movie's cinema... N [isn't it the ultimate sign of a movie's cinem... 32 [is, it, the, ultimate, sign, of, a, movie, ci... 553 [ultimate, sign, movie, cinematic, ineptitude,... 297 [(the, 28), (a, 18), (of, 16), (to, 14), (i, 1... [(movie, 7), (one, 6), (first, 5), (much, 4), ... {'ultimate': 1, 'sign': 1, 'movie': 7, 'cinema... {'is': 11, 'it': 11, 'the': 28, 'ultimate': 1,...
2 " gordy " is not a movie , it is a 90-minute-... N [ " gordy " is not a movie , it is a 90-minute... 23 [gordy, is, not, a, movie, it, is, a, sesame, ... 478 [gordy, movie, sesame, street, skit, bad, one,... 239 [(the, 25), (and, 21), (to, 18), (is, 17), (a,... [(gordy, 8), (movie, 5), (one, 4), (stupid, 4)... {'gordy': 8, 'movie': 5, 'sesame': 1, 'street'... {'gordy': 8, 'is': 17, 'not': 3, 'a': 17, 'mov...
3 disconnect the phone line . \ndon't accept the... N [disconnect the phone line ., don't accept the... 37 [disconnect, the, phone, line, do, accept, the... 604 [disconnect, phone, line, accept, charges, any... 323 [(the, 41), (of, 17), (a, 17), (to, 16), (and,... [(hanging, 9), (sisters, 5), (ryan, 4), (time,... {'disconnect': 1, 'phone': 2, 'line': 1, 'acce... {'disconnect': 1, 'the': 41, 'phone': 2, 'line...
4 when robert forster found himself famous again... N [when robert forster found himself famous agai... 29 [when, robert, forster, found, himself, famous... 386 [robert, forster, found, famous, appearing, ja... 185 [(the, 21), (it, 11), (i, 10), (to, 10), (of, ... [(film, 5), (movie, 5), (american, 4), (perfek... {'robert': 2, 'forster': 3, 'found': 1, 'famou... {'when': 2, 'robert': 2, 'forster': 3, 'found'...
... ... ... ... ... ... ... ... ... ... ... ... ...
995 one of the funniest carry on movies and the th... P [one of the funniest carry on movies and the t... 25 [one, of, the, funniest, carry, on, movies, an... 434 [one, funniest, carry, movies, third, medical,... 241 [(the, 26), (and, 21), (of, 11), (a, 10), (is,... [(nookey, 9), (hawtrey, 5), (carry, 4), (dr, 4... {'one': 1, 'funniest': 1, 'carry': 4, 'movies'... {'one': 1, 'of': 11, 'the': 26, 'funniest': 1,...
996 i remember making a pact , right after `patch ... P [i remember making a pact , right after `patch... 40 [i, remember, making, a, pact, right, after, p... 652 [remember, making, pact, right, patch, adams, ... 361 [(the, 44), (of, 29), (and, 19), (a, 15), (it,... [(music, 8), (heart, 7), (craven, 6), (movie, ... {'remember': 1, 'making': 1, 'pact': 1, 'right... {'i': 1, 'remember': 1, 'making': 1, 'a': 15, ...
997 barely scrapping by playing at a nyc piano bar... P [barely scrapping by playing at a nyc piano ba... 23 [barely, scrapping, by, playing, at, a, nyc, p... 345 [barely, scrapping, playing, nyc, piano, bar, ... 177 [(a, 23), (is, 16), (the, 13), (and, 10), (of,... [(like, 4), (hutton, 3), (old, 3), (high, 2), ... {'barely': 1, 'scrapping': 1, 'playing': 1, 'n... {'barely': 1, 'scrapping': 1, 'by': 2, 'playin...
998 if the current trends of hollywood filmmaking ... P [if the current trends of hollywood filmmaking... 34 [if, the, current, trends, of, hollywood, film... 730 [current, trends, hollywood, filmmaking, conti... 428 [(the, 49), (of, 31), (and, 19), (in, 18), (to... [(one, 7), (like, 5), (l, 5), (hollywood, 4), ... {'current': 1, 'trends': 1, 'hollywood': 4, 'f... {'if': 1, 'the': 49, 'current': 1, 'trends': 1...
999 capsule : the director of cure brings a weird ... P [capsule : the director of cure brings a weird... 45 [capsule, the, director, of, cure, brings, a, ... 641 [capsule, director, cure, brings, weird, compl... 340 [(the, 33), (to, 28), (and, 21), (a, 18), (of,... [(computer, 11), (kurosawa, 8), (one, 5), (see... {'capsule': 1, 'director': 1, 'cure': 3, 'brin... {'capsule': 1, 'the': 33, 'director': 1, 'of':...

2000 rows × 12 columns

STEP 6: Try Different Sentiment Analysis Tools

VADER

In [16]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
def get_vader_score(review):
    return sid.polarity_scores(review)

all_df['vader_all'] = all_df.apply(lambda x: get_vader_score(x[0]),axis=1)
In [17]:
def separate_vader_score(vader_score, key):
    return vader_score[key]

all_df['v_compound'] = all_df.apply(lambda x: separate_vader_score(x['vader_all'], 'compound'),axis=1)
all_df['v_neg'] = all_df.apply(lambda x: separate_vader_score(x['vader_all'], 'neg'),axis=1)
all_df['v_neu'] = all_df.apply(lambda x: separate_vader_score(x['vader_all'], 'neu'),axis=1)
all_df['v_pos'] = all_df.apply(lambda x: separate_vader_score(x['vader_all'], 'pos'),axis=1)

DIY SUMMARY

In [18]:
all_df[0][17]
Out[18]:
17    about an hour or so into " the jackal , " a ch...
17    meet joe black ( reviewed on nov . 27/98 ) \ns...
Name: 0, dtype: object
In [19]:
def get_weighted_freq_dist(review, freq_dist):
    try:
        max_freq = max(freq_dist.values())
        for word in freq_dist.keys():
            freq_dist[word] = (freq_dist[word]/max_freq)
        return freq_dist
    except:
        return 'nope'

all_df['weighted_freq_dist'] = all_df.apply(lambda x: get_weighted_freq_dist(x['sentences'], x['freq_dist']),axis=1)
In [20]:
def get_sentence_score(review, freq_dist):
    sentence_scores = {}
    for sent in review:
        for word in nltk.word_tokenize(sent.lower()):
            if word in freq_dist.keys():
                if len(sent.split(' ')) < 30:
                    if sent not in sentence_scores.keys():
                        sentence_scores[sent] = freq_dist[word]
                    else:
                        sentence_scores[sent] += freq_dist[word]
    return sentence_scores

all_df['sentence_scores'] = all_df.apply(lambda x: get_sentence_score(x['sentences'], x['freq_dist']),axis=1)
In [21]:
def get_summary_sentences(sentence_scores):
    sorted_sentences = sorted(sentence_scores.items(), key=lambda kv: kv[1], reverse=True)
    return ''.join(sent[0] for sent in sorted_sentences[:5])

all_df['summary_sentences'] = all_df.apply(lambda x: get_summary_sentences(x['sentence_scores']), axis=1)
In [22]:
summaries = all_df['summary_sentences'].tolist()
In [23]:
summaries[3]
Out[23]:
"cell-phones ring every five minutes , and everyone hurriedly rushes along , leaving marginal time for the frustrated viewer to relate to the sisters' issues and problems .i figured i needed to get in touch with my feminine side , and `hanging up' seemed like an ideal opportunity to do so .ryan's convincing performance and diverting cuteness are two of the more agreeable aspects of `hanging up' .it's certainly a far cry from what one would label as a rewarding experience , but `hanging up' should have at least been enjoyable .maddy ( kudrow ) , the soap opera actress , spends time either contemplating her possible path to stardom or nursing her dog ."

Doing VADER on the Summary Section

In [24]:
all_df['vader_sum_all'] = all_df.apply(lambda x: get_vader_score(x['summary_sentences']),axis=1)
In [25]:
all_df['v_compound_sum'] = all_df.apply(lambda x: separate_vader_score(x['vader_sum_all'], 'compound'),axis=1)
all_df['v_neg_sum'] = all_df.apply(lambda x: separate_vader_score(x['vader_sum_all'], 'neg'),axis=1)
all_df['v_neu_sum'] = all_df.apply(lambda x: separate_vader_score(x['vader_sum_all'], 'neu'),axis=1)
all_df['v_pos_sum'] = all_df.apply(lambda x: separate_vader_score(x['vader_sum_all'], 'pos'),axis=1)

Doing VADER on the Most Frequent Words

In [26]:
def get_freq_words(freq_dist):
    sorted_words = sorted(freq_dist.items(), key=lambda kv: kv[1], reverse=True)
    return ' '.join(word[0] for word in sorted_words[:50])

all_df['v_freq_words'] = all_df.apply(lambda x: get_freq_words(x['freq_dist']), axis=1)

all_df['vader_fq_all'] = all_df.apply(lambda x: get_vader_score(x['v_freq_words']),axis=1)
all_df['v_compound_fd'] = all_df.apply(lambda x: separate_vader_score(x['vader_fq_all'], 'compound'),axis=1)
all_df['v_neg_fd'] = all_df.apply(lambda x: separate_vader_score(x['vader_fq_all'], 'neg'),axis=1)
all_df['v_neu_fd'] = all_df.apply(lambda x: separate_vader_score(x['vader_fq_all'], 'neu'),axis=1)
all_df['v_pos_fd'] = all_df.apply(lambda x: separate_vader_score(x['vader_fq_all'], 'pos'),axis=1)

STEP 7: Test Step 6 with Machine Learning!!

Naive Bayes

In [27]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

def get_NB(small_df, labels):
    x_train, x_test, y_train, y_test = train_test_split(small_df.values, labels, test_size=0.3, random_state = 109)

    gnb = GaussianNB()
    gnb.fit(x_train, y_train)
    y_pred = gnb.predict(x_test)
    from sklearn import metrics
    print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

TEST 1: Vader Scores (Original)

In [28]:
small_df = all_df.filter(['v_compound','v_pos', 'v_neg', 'v_neu']) # 0.645
get_NB(small_df, all_df['PoN'])
Accuracy: 0.645

TEST 2: Vader Scores (from Summary)

In [29]:
small_df = all_df.filter(['v_compound_sum','v_pos_sum', 'v_neg_sum', 'v_neu_sum']) # 0.59
get_NB(small_df, all_df['PoN'])
Accuracy: 0.59

TEST 3: Vader Scores (original) AND Vader Scores (summary)

In [30]:
small_df = all_df.filter(['v_compound_sum','v_pos_sum', 'v_neg_sum', 'v_neu_sum', 
                          'v_compound','v_pos', 'v_neg', 'v_neu']) # 0.618
get_NB(small_df, all_df['PoN'])
Accuracy: 0.6183333333333333

TEST 4: Vader Scores (50 most frequent -- filtered -- words)

In [31]:
small_df = all_df.filter(['v_compound_fd','v_pos_fd', 'v_neu_fd', 'v_neg_fd']) # 0.598
get_NB(small_df, all_df['PoN'])
Accuracy: 0.5983333333333334

TEST 5: All compound Vader Scores

In [32]:
small_df = all_df.filter(['v_compound_fd','v_compound_sum', 'v_compound']) # 0.615
get_NB(small_df, all_df['PoN'])
Accuracy: 0.615

TEST 6: ALL THE NUMBERS!!

In [33]:
small_df = all_df.filter(['v_compound_sum','v_pos_sum', 'v_neg_sum', 'v_neu_sum', 
                          'v_compound_fd','v_pos_fd', 'v_neg_fd', 'v_neu_fd', 
                          'v_compound','v_pos', 'v_neg', 'v_neu']) # 0.613
get_NB(small_df, all_df['PoN'])
Accuracy: 0.6133333333333333

TEST 7: Test UNFILTERED most frequent words

In [34]:
def get_freq_words(freq_dist):
    sorted_words = sorted(freq_dist.items(), key=lambda kv: kv[1], reverse=True)
    return ' '.join(word[0] for word in sorted_words[:50])

all_df['v_freq_words_unfil'] = all_df.apply(lambda x: get_freq_words(x['freq_dist_unfil']), axis=1)

all_df['vader_fd_all_unfil'] = all_df.apply(lambda x: get_vader_score(x['v_freq_words_unfil']),axis=1)

all_df['v_compound_fd_uf'] = all_df.apply(lambda x: separate_vader_score(x['vader_fd_all_unfil'], 'compound'),axis=1)
all_df['v_neg_fd_uf'] = all_df.apply(lambda x: separate_vader_score(x['vader_fd_all_unfil'], 'neg'),axis=1)
all_df['v_neu_fd_uf'] = all_df.apply(lambda x: separate_vader_score(x['vader_fd_all_unfil'], 'neu'),axis=1)
all_df['v_pos_fd_uf'] = all_df.apply(lambda x: separate_vader_score(x['vader_fd_all_unfil'], 'pos'),axis=1)
In [35]:
small_df = all_df.filter(['v_compound_sum','v_pos_sum', 'v_neg_sum', 'v_neu_sum', 
                          'v_compound_fd','v_pos_fd', 'v_neg_fd', 'v_neu_fd', 
                          'v_compound_fd_uf','v_pos_fd_uf', 'v_neg_fd_uf', 'v_neu_fd_uf',
                          'v_compound','v_pos', 'v_neg', 'v_neu']) # 0.618
get_NB(small_df, all_df['PoN'])
Accuracy: 0.62
In [36]:
small_df = all_df.filter(['v_compound_fd_uf','v_pos_fd_uf', 'v_neg_fd_uf', 'v_neu_fd_uf']) # 0.603
get_NB(small_df, all_df['PoN'])
Accuracy: 0.6033333333333334
In [37]:
summaries_pos = all_df[all_df['PoN'] == 'P']
summaries_neg = all_df[all_df['PoN'] == 'N']
In [38]:
summaries_pos_list = summaries_pos['summary_sentences'].tolist()
summaries_neg_list = summaries_neg['summary_sentences'].tolist()
In [39]:
summaries_pos_list[:1]
Out[39]:
['charles walks in on amy and oscar having a drink one night , as oscar and amy have become great friends , but he doesn\'t seem to mind .neve is delightful as her conflicted character , who feels love for oscar , but knows , based on rumors , that he is gay .the bottom line : three to tango is a light , sharp , snappy romantic comedy with a superb ending , and great stars .well , another popular phrase of the 90\'s is " all good things must come to an end , " and this stays true for oscar as well .oscar gladly takes the job , and meets amy at an art show of hers , and sparks fly between the two from the get go .']
In [40]:
summaries_neg_list[:1]
Out[40]:
["but the wretched dialogue goes along well with the wretched quality of everything else in this movie .i don't know , but all the big words in the world wouldn't be able to disguise the bad writing and even worse acting .hey , it a sexist movie , so i'm writing a sexist review .this goes along with the rest of the idiotic thinking in the movie .there are a couple of other idiotic subplots thrown in for good measure , but the fame is the one that pretty much sums up this thing ."]
In [41]:
summaries_neg_list[:1]
Out[41]:
["but the wretched dialogue goes along well with the wretched quality of everything else in this movie .i don't know , but all the big words in the world wouldn't be able to disguise the bad writing and even worse acting .hey , it a sexist movie , so i'm writing a sexist review .this goes along with the rest of the idiotic thinking in the movie .there are a couple of other idiotic subplots thrown in for good measure , but the fame is the one that pretty much sums up this thing ."]
In [42]:
### VERSION 1
#     all_words_neg = sentim_analyzer.all_words([mark_negation(doc) for doc in training_docs])
#     unigram_feats = sentim_analyzer.unigram_word_feats(all_words_neg)
#     sentim_analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_feats)
#     training_set = sentim_analyzer.apply_features(training_docs)
#     test_set = sentim_analyzer.apply_features(testing_docs)
sentim_analyzer = SentimentAnalyzer()

def get_nltk_negs(tokens):
    all_words_neg = sentim_analyzer.all_words([mark_negation(tokens)])
    return all_words_neg

def get_unigram_feats(neg_tokens):
    unigram_feats = sentim_analyzer.unigram_word_feats(neg_tokens)
    return unigram_feats
    
all_df['nltk_negs'] = all_df.apply(lambda x: get_nltk_negs(x['tokens']), axis=1)
all_df['unigram_feats'] = all_df.apply(lambda x: get_unigram_feats(x['nltk_negs']), axis=1)
# all_df['nltk_unfil'] = all_df.apply(lambda x: get_nltk_data(x['tokens']), axis=1)
In [43]:
### VERSION 2
#     all_words_neg = sentim_analyzer.all_words([mark_negation(doc) for doc in training_docs])
#     unigram_feats = sentim_analyzer.unigram_word_feats(all_words_neg)
#     sentim_analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_feats)
#     training_set = sentim_analyzer.apply_features(training_docs)
#     test_set = sentim_analyzer.apply_features(testing_docs)
sentim_analyzer = SentimentAnalyzer()

def get_nltk_data(tokens):
    neg_tokens = sentim_analyzer.all_words([mark_negation(tokens)])
    unigram_feats = sentim_analyzer.unigram_word_feats(neg_tokens)
    sentim_analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_feats)
    return sentim_analyzer.apply_features(tokens)


# def get_unigram_feats(neg_tokens):
    
#     return unigram_feats
nltk_df = pd.DataFrame()
nltk_df['nltk_data'] = all_df.apply(lambda x: get_nltk_data(x['tokens']), axis=1)

# all_df['nltk']
# all_df['unigram_feats'] = all_df.apply(lambda x: get_unigram_feats(x['nltk_negs']), axis=1)
# all_df['nltk_unfil'] = all_df.apply(lambda x: get_nltk_data(x['tokens']), axis=1)
In [44]:
all_df['nltk_all'] = 0
In [45]:
all_df['nltk_all']
Out[45]:
0      0
1      0
2      0
3      0
4      0
      ..
995    0
996    0
997    0
998    0
999    0
Name: nltk_all, Length: 2000, dtype: int64
In [46]:
all_df
Out[46]:
0 PoN sentences num_sentences tokens num_tokens no_sw num_no_sw topwords_unfil topwords_fil ... v_pos_fd v_freq_words_unfil vader_fd_all_unfil v_compound_fd_uf v_neg_fd_uf v_neu_fd_uf v_pos_fd_uf nltk_negs unigram_feats nltk_all
0 bad . bad . \nbad . \nthat one word seems to p... N [bad ., bad ., bad ., that one word seems to p... 67 [bad, bad, bad, that, one, word, seems, to, pr... 1071 [bad, bad, bad, one, word, seems, pretty, much... 515 [(the, 60), (a, 35), (to, 34), (of, 24), (this... [(movie, 17), (bad, 8), (one, 7), (meyer, 6), ... ... 0.219 the a to of this that i in is movie it and you... {'neg': 0.046, 'neu': 0.954, 'pos': 0.0, 'comp... -0.3071 0.046 0.954 0.000 [bad, bad, bad, that, one, word, seems, to, pr... [the_NEG, to_NEG, a_NEG, of_NEG, this_NEG, i_N... 0
1 isn't it the ultimate sign of a movie's cinema... N [isn't it the ultimate sign of a movie's cinem... 32 [is, it, the, ultimate, sign, of, a, movie, ci... 553 [ultimate, sign, movie, cinematic, ineptitude,... 297 [(the, 28), (a, 18), (of, 16), (to, 14), (i, 1... [(movie, 7), (one, 6), (first, 5), (much, 4), ... ... 0.173 the a of to i is it and movie this in some one... {'neg': 0.1, 'neu': 0.9, 'pos': 0.0, 'compound... -0.6262 0.100 0.900 0.000 [is, it, the, ultimate, sign, of, a, movie, ci... [the_NEG, a_NEG, of_NEG, i_NEG, to_NEG, is_NEG... 0
2 " gordy " is not a movie , it is a 90-minute-... N [ " gordy " is not a movie , it is a 90-minute... 23 [gordy, is, not, a, movie, it, is, a, sesame, ... 478 [gordy, movie, sesame, street, skit, bad, one,... 239 [(the, 25), (and, 21), (to, 18), (is, 17), (a,... [(gordy, 8), (movie, 5), (one, 4), (stupid, 4)... ... 0.103 the and to is a it of this gordy that but on m... {'neg': 0.231, 'neu': 0.769, 'pos': 0.0, 'comp... -0.9413 0.231 0.769 0.000 [gordy, is, not, a_NEG, movie_NEG, it_NEG, is_... [the_NEG, and_NEG, to_NEG, a_NEG, is_NEG, it_N... 0
3 disconnect the phone line . \ndon't accept the... N [disconnect the phone line ., don't accept the... 37 [disconnect, the, phone, line, do, accept, the... 604 [disconnect, phone, line, accept, charges, any... 323 [(the, 41), (of, 17), (a, 17), (to, 16), (and,... [(hanging, 9), (sisters, 5), (ryan, 4), (time,... ... 0.248 the of a to and is up hanging in as for an tha... {'neg': 0.0, 'neu': 0.869, 'pos': 0.131, 'comp... 0.7876 0.000 0.869 0.131 [disconnect, the, phone, line, do, accept, the... [the, the_NEG, a_NEG, is_NEG, and, of_NEG, to,... 0
4 when robert forster found himself famous again... N [when robert forster found himself famous agai... 29 [when, robert, forster, found, himself, famous... 386 [robert, forster, found, famous, appearing, ja... 185 [(the, 21), (it, 11), (i, 10), (to, 10), (of, ... [(film, 5), (movie, 5), (american, 4), (perfek... ... 0.000 the it i to of and a was is you for film this ... {'neg': 0.056, 'neu': 0.944, 'pos': 0.0, 'comp... -0.4215 0.056 0.944 0.000 [when, robert, forster, found, himself, famous... [the_NEG, it_NEG, of_NEG, and_NEG, i_NEG, to_N... 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 one of the funniest carry on movies and the th... P [one of the funniest carry on movies and the t... 25 [one, of, the, funniest, carry, on, movies, an... 434 [one, funniest, carry, movies, third, medical,... 241 [(the, 26), (and, 21), (of, 11), (a, 10), (is,... [(nookey, 9), (hawtrey, 5), (carry, 4), (dr, 4... ... 0.266 the and of a is to on nookey as in who from hi... {'neg': 0.041, 'neu': 0.862, 'pos': 0.097, 'co... 0.4576 0.041 0.862 0.097 [one, of, the, funniest, carry, on, movies, an... [the, and, the_NEG, to, nookey, and_NEG, of, a... 0
996 i remember making a pact , right after `patch ... P [i remember making a pact , right after `patch... 40 [i, remember, making, a, pact, right, after, p... 652 [remember, making, pact, right, patch, adams, ... 361 [(the, 44), (of, 29), (and, 19), (a, 15), (it,... [(music, 8), (heart, 7), (craven, 6), (movie, ... ... 0.236 the of and a it to is with in but her music he... {'neg': 0.0, 'neu': 0.866, 'pos': 0.134, 'comp... 0.8047 0.000 0.866 0.134 [i, remember, making, a, pact, right, after, p... [the_NEG, of_NEG, and_NEG, it_NEG, a_NEG, is_N... 0
997 barely scrapping by playing at a nyc piano bar... P [barely scrapping by playing at a nyc piano ba... 23 [barely, scrapping, by, playing, at, a, nyc, p... 345 [barely, scrapping, playing, nyc, piano, bar, ... 177 [(a, 23), (is, 16), (the, 13), (and, 10), (of,... [(like, 4), (hutton, 3), (old, 3), (high, 2), ... ... 0.196 a is the and of with his for in to like she it... {'neg': 0.056, 'neu': 0.783, 'pos': 0.162, 'co... 0.7273 0.056 0.783 0.162 [barely, scrapping, by, playing, at, a, nyc, p... [a_NEG, is_NEG, a, the, with_NEG, the_NEG, for... 0
998 if the current trends of hollywood filmmaking ... P [if the current trends of hollywood filmmaking... 34 [if, the, current, trends, of, hollywood, film... 730 [current, trends, hollywood, filmmaking, conti... 428 [(the, 49), (of, 31), (and, 19), (in, 18), (to... [(one, 7), (like, 5), (l, 5), (hollywood, 4), ... ... 0.166 the of and in to that a is his by one as for l... {'neg': 0.0, 'neu': 0.859, 'pos': 0.141, 'comp... 0.7506 0.000 0.859 0.141 [if, the, current, trends, of, hollywood, film... [the, the_NEG, of_NEG, of, and_NEG, to, in_NEG... 0
999 capsule : the director of cure brings a weird ... P [capsule : the director of cure brings a weird... 45 [capsule, the, director, of, cure, brings, a, ... 641 [capsule, director, cure, brings, weird, compl... 340 [(the, 33), (to, 28), (and, 21), (a, 18), (of,... [(computer, 11), (kurosawa, 8), (one, 5), (see... ... 0.136 the to and a of is his computer are with on no... {'neg': 0.082, 'neu': 0.828, 'pos': 0.09, 'com... 0.3497 0.082 0.828 0.090 [capsule, the, director, of, cure, brings, a, ... [the_NEG, to_NEG, and_NEG, a_NEG, of_NEG, is_N... 0

2000 rows × 40 columns

In [47]:
from nltk.tokenize import casual_tokenize
from collections import Counter
all_df['bow_nosw'] = all_df.apply(lambda x: Counter(casual_tokenize(x[0])), axis=1)
In [48]:
all_df
Out[48]:
0 PoN sentences num_sentences tokens num_tokens no_sw num_no_sw topwords_unfil topwords_fil ... v_freq_words_unfil vader_fd_all_unfil v_compound_fd_uf v_neg_fd_uf v_neu_fd_uf v_pos_fd_uf nltk_negs unigram_feats nltk_all bow_nosw
0 bad . bad . \nbad . \nthat one word seems to p... N [bad ., bad ., bad ., that one word seems to p... 67 [bad, bad, bad, that, one, word, seems, to, pr... 1071 [bad, bad, bad, one, word, seems, pretty, much... 515 [(the, 60), (a, 35), (to, 34), (of, 24), (this... [(movie, 17), (bad, 8), (one, 7), (meyer, 6), ... ... the a to of this that i in is movie it and you... {'neg': 0.046, 'neu': 0.954, 'pos': 0.0, 'comp... -0.3071 0.046 0.954 0.000 [bad, bad, bad, that, one, word, seems, to, pr... [the_NEG, to_NEG, a_NEG, of_NEG, this_NEG, i_N... 0 {'bad': 8, '.': 62, 'that': 19, 'one': 7, 'wor...
1 isn't it the ultimate sign of a movie's cinema... N [isn't it the ultimate sign of a movie's cinem... 32 [is, it, the, ultimate, sign, of, a, movie, ci... 553 [ultimate, sign, movie, cinematic, ineptitude,... 297 [(the, 28), (a, 18), (of, 16), (to, 14), (i, 1... [(movie, 7), (one, 6), (first, 5), (much, 4), ... ... the a of to i is it and movie this in some one... {'neg': 0.1, 'neu': 0.9, 'pos': 0.0, 'compound... -0.6262 0.100 0.900 0.000 [is, it, the, ultimate, sign, of, a, movie, ci... [the_NEG, a_NEG, of_NEG, i_NEG, to_NEG, is_NEG... 0 {'isn't': 2, 'it': 9, 'the': 28, 'ultimate': 1...
2 " gordy " is not a movie , it is a 90-minute-... N [ " gordy " is not a movie , it is a 90-minute... 23 [gordy, is, not, a, movie, it, is, a, sesame, ... 478 [gordy, movie, sesame, street, skit, bad, one,... 239 [(the, 25), (and, 21), (to, 18), (is, 17), (a,... [(gordy, 8), (movie, 5), (one, 4), (stupid, 4)... ... the and to is a it of this gordy that but on m... {'neg': 0.231, 'neu': 0.769, 'pos': 0.0, 'comp... -0.9413 0.231 0.769 0.000 [gordy, is, not, a_NEG, movie_NEG, it_NEG, is_... [the_NEG, and_NEG, to_NEG, a_NEG, is_NEG, it_N... 0 {'"': 12, 'gordy': 8, 'is': 16, 'not': 3, 'a':...
3 disconnect the phone line . \ndon't accept the... N [disconnect the phone line ., don't accept the... 37 [disconnect, the, phone, line, do, accept, the... 604 [disconnect, phone, line, accept, charges, any... 323 [(the, 41), (of, 17), (a, 17), (to, 16), (and,... [(hanging, 9), (sisters, 5), (ryan, 4), (time,... ... the of a to and is up hanging in as for an tha... {'neg': 0.0, 'neu': 0.869, 'pos': 0.131, 'comp... 0.7876 0.000 0.869 0.131 [disconnect, the, phone, line, do, accept, the... [the, the_NEG, a_NEG, is_NEG, and, of_NEG, to,... 0 {'disconnect': 1, 'the': 41, 'phone': 2, 'line...
4 when robert forster found himself famous again... N [when robert forster found himself famous agai... 29 [when, robert, forster, found, himself, famous... 386 [robert, forster, found, famous, appearing, ja... 185 [(the, 21), (it, 11), (i, 10), (to, 10), (of, ... [(film, 5), (movie, 5), (american, 4), (perfek... ... the it i to of and a was is you for film this ... {'neg': 0.056, 'neu': 0.944, 'pos': 0.0, 'comp... -0.4215 0.056 0.944 0.000 [when, robert, forster, found, himself, famous... [the_NEG, it_NEG, of_NEG, and_NEG, i_NEG, to_N... 0 {'when': 2, 'robert': 2, 'forster': 3, 'found'...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 one of the funniest carry on movies and the th... P [one of the funniest carry on movies and the t... 25 [one, of, the, funniest, carry, on, movies, an... 434 [one, funniest, carry, movies, third, medical,... 241 [(the, 26), (and, 21), (of, 11), (a, 10), (is,... [(nookey, 9), (hawtrey, 5), (carry, 4), (dr, 4... ... the and of a is to on nookey as in who from hi... {'neg': 0.041, 'neu': 0.862, 'pos': 0.097, 'co... 0.4576 0.041 0.862 0.097 [one, of, the, funniest, carry, on, movies, an... [the, and, the_NEG, to, nookey, and_NEG, of, a... 0 {'one': 1, 'of': 11, 'the': 26, 'funniest': 1,...
996 i remember making a pact , right after `patch ... P [i remember making a pact , right after `patch... 40 [i, remember, making, a, pact, right, after, p... 652 [remember, making, pact, right, patch, adams, ... 361 [(the, 44), (of, 29), (and, 19), (a, 15), (it,... [(music, 8), (heart, 7), (craven, 6), (movie, ... ... the of and a it to is with in but her music he... {'neg': 0.0, 'neu': 0.866, 'pos': 0.134, 'comp... 0.8047 0.000 0.866 0.134 [i, remember, making, a, pact, right, after, p... [the_NEG, of_NEG, and_NEG, it_NEG, a_NEG, is_N... 0 {'i': 1, 'remember': 1, 'making': 1, 'a': 15, ...
997 barely scrapping by playing at a nyc piano bar... P [barely scrapping by playing at a nyc piano ba... 23 [barely, scrapping, by, playing, at, a, nyc, p... 345 [barely, scrapping, playing, nyc, piano, bar, ... 177 [(a, 23), (is, 16), (the, 13), (and, 10), (of,... [(like, 4), (hutton, 3), (old, 3), (high, 2), ... ... a is the and of with his for in to like she it... {'neg': 0.056, 'neu': 0.783, 'pos': 0.162, 'co... 0.7273 0.056 0.783 0.162 [barely, scrapping, by, playing, at, a, nyc, p... [a_NEG, is_NEG, a, the, with_NEG, the_NEG, for... 0 {'barely': 1, 'scrapping': 1, 'by': 2, 'playin...
998 if the current trends of hollywood filmmaking ... P [if the current trends of hollywood filmmaking... 34 [if, the, current, trends, of, hollywood, film... 730 [current, trends, hollywood, filmmaking, conti... 428 [(the, 49), (of, 31), (and, 19), (in, 18), (to... [(one, 7), (like, 5), (l, 5), (hollywood, 4), ... ... the of and in to that a is his by one as for l... {'neg': 0.0, 'neu': 0.859, 'pos': 0.141, 'comp... 0.7506 0.000 0.859 0.141 [if, the, current, trends, of, hollywood, film... [the, the_NEG, of_NEG, of, and_NEG, to, in_NEG... 0 {'if': 1, 'the': 49, 'current': 1, 'trends': 1...
999 capsule : the director of cure brings a weird ... P [capsule : the director of cure brings a weird... 45 [capsule, the, director, of, cure, brings, a, ... 641 [capsule, director, cure, brings, weird, compl... 340 [(the, 33), (to, 28), (and, 21), (a, 18), (of,... [(computer, 11), (kurosawa, 8), (one, 5), (see... ... the to and a of is his computer are with on no... {'neg': 0.082, 'neu': 0.828, 'pos': 0.09, 'com... 0.3497 0.082 0.828 0.090 [capsule, the, director, of, cure, brings, a, ... [the_NEG, to_NEG, and_NEG, a_NEG, of_NEG, is_N... 0 {'capsule': 1, ':': 1, 'the': 33, 'director': ...

2000 rows × 41 columns

In [ ]:
 
In [ ]:
# from nltk.tokenize import casual_tokenize
# from collections import Counter
# # all_df['bow_nosw'] = all_df.apply(lambda x: Counter(casual_tokenize(x[0])), axis=1)