SENTIMENT ANALYSIS (PANDAS STYLE!)

STEP 1: Import ALL the things!

Libraries and paths and files

I'm sure there is a cleaner way to do this, plz lmk via email

In [181]:
import os
import pandas as pd
negative = os.listdir('NEG/')
positive = os.listdir('POS/')
In [189]:
positive_alltext = []
for file in positive:
    f=open('POS/'+file)
    content=f.read()
    positive_alltext.append(content)
    f.close()

negative_alltext = []
for file in negative:
    f=open('NEG/'+file)
    content=f.read()
    negative_alltext.append(content)
    f.close()

STEP 2: Turn that fresh text into a pandas DF and add a column to mark it as either positive or negative

In [183]:
positive_df = pd.DataFrame(positive_alltext)
negative_df = pd.DataFrame(negative_alltext)
In [184]:
positive_df['PoN'] = 'P'
negative_df['PoN'] = 'N'
In [185]:
# Combine the pos and neg dfs
all_df = positive_df.append(negative_df)
In [186]:
# Our results!
all_df
Out[186]:
0 PoN
0 films adapted from comic books have had plenty... P
1 you've got mail works alot better than it dese... P
2 " jaws " is a rare film that grabs your atten... P
3 every now and then a movie comes along from a ... P
4 moviemaking is a lot like being the general ma... P
0 that's exactly how long the movie felt to me .... N
1 " quest for camelot " is warner bros . ' firs... N
2 so ask yourself what " 8mm " ( " eight millime... N
3 synopsis : a mentally unstable man undergoing ... N
4 capsule : in 2176 on the planet mars police ta... N

STEP 3: TOKENIZE (and clean)!!

In [187]:
''' 
clean_tokens = [word.lower() for word in tokens if word.isalpha()]
IN ENGLISH: for every word in this set of words lower case the word if it is "is alpha"
"isalpha()" meaning "not a number or punctuation"
'''

from nltk.tokenize import word_tokenize
def get_tokens(sentence):
    tokens = word_tokenize(sentence)
    clean_tokens = [word.lower() for word in tokens if word.isalpha()]
    return clean_tokens

all_df['tokenized'] = all_df.apply(lambda x: get_tokens(x[0]),axis=1)
all_df['tokenized_count'] = all_df.apply(lambda x: len(x['tokenized']),axis=1)

STEP 4: Remove Stopwords

In [172]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
def remove_stopwords(sentence):
    filtered_text = []
    for word in sentence:
        if word not in stop_words:
            filtered_text.append(word)
    return filtered_text
all_df['no_stopwords'] = all_df.apply(lambda x: remove_stopwords(x['tokenized']),axis=1)
all_df['no_stopwords_count'] = all_df.apply(lambda x: len(x['no_stopwords']),axis=1)
In [173]:
all_df
Out[173]:
0 PoN tokenized tokenized_count no_stopwords no_stopwords_count
0 films adapted from comic books have had plenty... P [films, adapted, from, comic, books, have, had... 673 [films, adapted, comic, books, plenty, success... 387
1 you've got mail works alot better than it dese... P [you, got, mail, works, alot, better, than, it... 412 [got, mail, works, alot, better, deserves, ord... 203
2 " jaws " is a rare film that grabs your atten... P [jaws, is, a, rare, film, that, grabs, your, a... 993 [jaws, rare, film, grabs, attention, shows, si... 552
3 every now and then a movie comes along from a ... P [every, now, and, then, a, movie, comes, along... 628 [every, movie, comes, along, suspect, studio, ... 326
4 moviemaking is a lot like being the general ma... P [moviemaking, is, a, lot, like, being, the, ge... 630 [moviemaking, lot, like, general, manager, nfl... 345
0 that's exactly how long the movie felt to me .... N [that, exactly, how, long, the, movie, felt, t... 550 [exactly, long, movie, felt, even, nine, laugh... 308
1 " quest for camelot " is warner bros . ' firs... N [quest, for, camelot, is, warner, bros, first,... 444 [quest, camelot, warner, bros, first, attempt,... 247
2 so ask yourself what " 8mm " ( " eight millime... N [so, ask, yourself, what, eight, millimeter, i... 527 [ask, eight, millimeter, really, wholesome, su... 283
3 synopsis : a mentally unstable man undergoing ... N [synopsis, a, mentally, unstable, man, undergo... 706 [synopsis, mentally, unstable, man, undergoing... 371
4 capsule : in 2176 on the planet mars police ta... N [capsule, in, on, the, planet, mars, police, t... 649 [capsule, planet, mars, police, taking, custod... 355

STEP 5: Create a Frequency Distribution

In [174]:
from nltk.probability import FreqDist
def get_most_common(tokens):
    fdist = FreqDist(tokens)
    return fdist.most_common(1)
all_df['most_common_unfiltered_word'] = all_df.apply(lambda x: get_most_common(x['tokenized']),axis=1)
In [175]:
from nltk.probability import FreqDist
def get_most_common(tokens):
    fdist = FreqDist(tokens)
    return fdist.most_common(5)
all_df['most_common_filtered_word'] = all_df.apply(lambda x: get_most_common(x['no_stopwords']),axis=1)
In [176]:
all_df
Out[176]:
0 PoN tokenized tokenized_count no_stopwords no_stopwords_count most_common_unfiltered_word most_common_filtered_word
0 films adapted from comic books have had plenty... P [films, adapted, from, comic, books, have, had... 673 [films, adapted, comic, books, plenty, success... 387 [(the, 46)] [(comic, 5), (hell, 5), (film, 5), (like, 4), ...
1 you've got mail works alot better than it dese... P [you, got, mail, works, alot, better, than, it... 412 [got, mail, works, alot, better, deserves, ord... 203 [(the, 33)] [(two, 3), (shop, 3), (much, 3), (fox, 3), (go...
2 " jaws " is a rare film that grabs your atten... P [jaws, is, a, rare, film, that, grabs, your, a... 993 [jaws, rare, film, grabs, attention, shows, si... 552 [(the, 63)] [(shark, 16), (jaws, 8), (film, 7), (spielberg...
3 every now and then a movie comes along from a ... P [every, now, and, then, a, movie, comes, along... 628 [every, movie, comes, along, suspect, studio, ... 326 [(the, 35)] [(even, 6), (gets, 6), (film, 5), (school, 5),...
4 moviemaking is a lot like being the general ma... P [moviemaking, is, a, lot, like, being, the, ge... 630 [moviemaking, lot, like, general, manager, nfl... 345 [(the, 41)] [(jackie, 10), (like, 9), (chan, 8), (got, 4),...
0 that's exactly how long the movie felt to me .... N [that, exactly, how, long, the, movie, felt, t... 550 [exactly, long, movie, felt, even, nine, laugh... 308 [(the, 31)] [(grant, 12), (movie, 7), (nine, 5), (hugh, 5)...
1 " quest for camelot " is warner bros . ' firs... N [quest, for, camelot, is, warner, bros, first,... 444 [quest, camelot, warner, bros, first, attempt,... 247 [(the, 21)] [(quest, 5), (camelot, 4), (kayley, 4), (disne...
2 so ask yourself what " 8mm " ( " eight millime... N [so, ask, yourself, what, eight, millimeter, i... 527 [ask, eight, millimeter, really, wholesome, su... 283 [(of, 21)] [(like, 4), (schumacher, 4), (film, 4), (welle...
3 synopsis : a mentally unstable man undergoing ... N [synopsis, a, mentally, unstable, man, undergo... 706 [synopsis, mentally, unstable, man, undergoing... 371 [(the, 48)] [(stalked, 12), (daryl, 7), (stalker, 6), (bro...
4 capsule : in 2176 on the planet mars police ta... N [capsule, in, on, the, planet, mars, police, t... 649 [capsule, planet, mars, police, taking, custod... 355 [(the, 30)] [(mars, 14), (ghosts, 10), (carpenter, 8), (fi...