HW3 -- DEALING WITH DIRTY DATA

STEP 1: Import the dirty data

In [47]:
import pandas as pd
dirtyFile = pd.read_csv('moviereviewRAW.csv')
dirtyFile[:2]
Out[47]:
text reviewclass Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 ... Unnamed: 168 Unnamed: 169 Unnamed: 170 Unnamed: 171 Unnamed: 172 Unnamed: 173 Unnamed: 174 Unnamed: 175 Unnamed: 176 Unnamed: 177
0 'plot : two teen couples go to a church party drink and then drive . \nthey get into an acc... but his girlfriend continues to see him in he... and has nightmares . \nwhat\'s the deal ? \nw... but presents it in a very bad package . \nwhi... since i generally applaud films which attempt... mess with your head and such ( lost highway &... but there are good and bad ways of making all... and these folks just didn\'t snag this one co... but executed it terribly . \nso what are the ... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 'the happy bastard\'s quick movie review \ndam... virus still feels very empty like a movie going for all flash and no subst... we don\'t know the origin of what took over t... and of course we don\'t know why donald sutherland is stumb... it\'s just \" hey let\'s chase these people around with some ro... even from the likes of curtis . \nyou\'re mor... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

2 rows × 178 columns

STEP 2: Join all the rows together

In [48]:
df = pd.DataFrame()
df['all'] = dirtyFile[dirtyFile.columns[0:]].apply(
    lambda x: ','.join(x.dropna().astype(str)),
    axis=1)
df[:4]
Out[48]:
all
0 'plot : two teen couples go to a church party ...
1 'the happy bastard\'s quick movie review \ndam...
2 'it is movies like these that make a jaded mov...
3 ' \" quest for camelot \" is warner bros . \' ...

STEP 3: Get the label

In [49]:
df['label'] = df.apply(lambda x: x['all'][-3], axis=1)
df[:4]
Out[49]:
all label
0 'plot : two teen couples go to a church party ... n
1 'the happy bastard\'s quick movie review \ndam... n
2 'it is movies like these that make a jaded mov... n
3 ' \" quest for camelot \" is warner bros . \' ... n

STEP 4: Clean the data

In [50]:
def clean_rogue_characters(string):
    exclude = ['\\',"\'",'"']
    string = ''.join(string.split('\\n'))
    string = ''.join(ch for ch in string if ch not in exclude)
    return string

df['all'] = df['all'].apply( lambda x: clean_rogue_characters(x) )
df['all'][0]
Out[50]:
'plot : two teen couples go to a church party , drink and then drive . they get into an accident . one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . whats the deal ? watch the movie and  sorta  find out . . . critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didnt snag this one correctly . they seem to have taken this pretty neat concept , but executed it terribly . so what are the problems with the movie ? well , its main problem is that its simply too jumbled . it starts off  normal  but then downshifts into this  fantasy  world in which you , as an audience member , have no idea whats going on . there are dreams , there are characters coming back from the dead , there are others who look like the dead , there are strange apparitions , there are disappearances , there are a looooot of chase scenes , there are tons of weird things that happen , and most of it is simply not explained . now i personally dont mind trying to unravel a film every now and then , but when all it does is give me the same clue over and over again , i get kind of fed up after a while , which is this films biggest problem . its obviously got this big secret to hide , but it seems to want to hide it completely until its final five minutes . and do they make things entertaining , thrilling or even engaging , in the meantime ? not really . the sad part is that the arrow and i both dig on flicks like this , so we actually figured most of it out by the half-way point , so all of the strangeness after that did start to make a little bit of sense , but it still didnt the make the film all that more entertaining . i guess the bottom line with movies like this is that you should always make sure that the audience is  into it  even before they are given the secret password to enter your world of understanding . i mean , showing melissa sagemiller running away from visions for about 20 minutes throughout the movie is just plain lazy ! ! okay , we get it . . . there are people chasing her and we dont know who they are . do we really need to see it over and over again ? how about giving us different scenes offering further insight into all of the strangeness going down in the movie ? apparently , the studio took this film away from its director and chopped it up themselves , and it shows . there mightve been a pretty decent teen mind-fuck movie in here somewhere , but i guess  the suits  decided that turning it into a music video with little edge , would make more sense . the actors are pretty good for the most part , although wes bentley just seemed to be playing the exact same character that he did in american beauty , only in a new neighborhood . but my biggest kudos go out to sagemiller , who holds her own throughout the entire film , and actually has you feeling her characters unraveling . overall , the film doesnt stick because it doesnt entertain , its confusing , it rarely excites and it feels pretty redundant for most of its runtime , despite a pretty cool ending and explanation to all of the craziness that came before it . oh , and by the way , this is not a horror or teen slasher flick . . . its just packaged to look that way because someone is apparently assuming that the genre is still hot with the kids . it also wrapped production two years ago and has been sitting on the shelves ever since . whatever . . . skip it ! wheres joblo coming from ? a nightmare of elm street 3 ( 7/10 ) - blair witch 2 ( 7/10 ) - the crow ( 9/10 ) - the crow : salvation ( 4/10 ) - lost highway ( 10/10 ) - memento ( 10/10 ) - the others ( 9/10 ) - stir of echoes ( 8/10 ) ,neg'

STEP 5: Export to Corpus for analysis

In [51]:
def print_to_file(rating, review, num):
    both = review
    output_filename = str(rating) + '_dirty_' + str(num) + '.txt'
    outfile = open(output_filename, 'w')
    outfile.write(both)
    outfile.close()

for num,row in enumerate(df['all']):
    print_to_file(row[-3], row[:-3], num)

STEP 6: Start the party

6a. Tokenize & Create Bag of Words

In [52]:
from nltk.tokenize import casual_tokenize
from collections import Counter
df['bow'] = df.apply(lambda x: Counter(casual_tokenize(x['all'])), axis=1)
In [64]:
freq_df = pd.DataFrame(df['bow'].tolist())
freq_df = freq_df.fillna(0).astype(int)
freq_df['DF_total'] = freq_df.apply(lambda x: sum(x), axis=1)
freq_df['DF_label'] = df['label']
# freq_df = freq_df.append(df['label'])
In [65]:
freq_df[:5]
Out[65]:
plot : two teen couples go to a church party ... snoots obstructions obscuring tangerine timbre powaqqatsi keyboardist capitalized DF_total DF_label
0 1 3 2 4 1 2 16 14 1 1 ... 0 0 0 0 0 0 0 0 810 n
1 0 0 0 0 0 0 2 13 0 0 ... 0 0 0 0 0 0 0 0 276 n
2 2 0 1 0 0 2 6 10 0 0 ... 0 0 0 0 0 0 0 0 549 n
3 0 0 0 0 0 0 14 11 0 0 ... 0 0 0 0 0 0 0 0 541 n
4 1 6 1 0 0 0 25 20 0 0 ... 0 0 0 0 0 0 0 0 840 n

5 rows × 47841 columns

NORMALIZING

Do I want to normalize on document? On corpus? On positive corpus?

In [ ]:
 
In [69]:
normalized_df = freq_df.copy()
# normalized_df = normalized_df[:10]
# normalized_df.reset_index()
normalized_df_label = normalized_df['DF_label']
normalized_df_no_label = normalized_df.drop('DF_label', axis = 1)
normalized_df_no_label
Out[69]:
plot : two teen couples go to a church party ... cage-world snoots obstructions obscuring tangerine timbre powaqqatsi keyboardist capitalized DF_total
0 1 3 2 4 1 2 16 14 1 1 ... 0 0 0 0 0 0 0 0 0 810
1 0 0 0 0 0 0 2 13 0 0 ... 0 0 0 0 0 0 0 0 0 276
2 2 0 1 0 0 2 6 10 0 0 ... 0 0 0 0 0 0 0 0 0 549
3 0 0 0 0 0 0 14 11 0 0 ... 0 0 0 0 0 0 0 0 0 541
4 1 6 1 0 0 0 25 20 0 0 ... 0 0 0 0 0 0 0 0 0 840
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1995 0 2 0 0 0 1 16 25 0 0 ... 0 0 0 0 0 0 0 0 0 853
1996 0 1 0 0 0 1 5 8 0 0 ... 0 0 0 0 0 0 0 0 0 358
1997 0 0 0 0 0 1 32 31 0 0 ... 0 0 0 0 0 0 0 0 0 1190
1998 1 0 0 0 0 1 20 7 0 0 ... 0 0 0 0 0 0 0 0 0 678
1999 1 0 2 0 0 0 26 20 0 0 ... 1 1 1 1 1 1 1 1 1 1113

2000 rows × 47840 columns

In [70]:
normalized_df_no_label = normalized_df_no_label.apply(lambda row: row/row['DF_total'] , axis=1)
In [71]:
normalized_df_label
Out[71]:
0       n
1       n
2       n
3       n
4       n
       ..
1995    p
1996    p
1997    p
1998    p
1999    p
Name: DF_label, Length: 2000, dtype: object

STEP 7: Naive Bayes

In [31]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

def get_NB(small_df, labels):
    x_train, x_test, y_train, y_test = train_test_split(small_df.values, labels, test_size=0.3, random_state = 109)

    gnb = GaussianNB()
    gnb.fit(x_train, y_train)
    y_pred = gnb.predict(x_test)
    from sklearn import metrics
    print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
In [40]:
# normalized_df_label = normalized_df['label']
# normalized_df_no_label = normalized_df.drop('label', axis=1)
In [72]:
get_NB(normalized_df_no_label, normalized_df_label)
Accuracy: 0.7033333333333334
In [43]:
normalized_df_no_label
Out[43]:
plot : two teen couples go to a church party ... cage-world snoots obstructions obscuring tangerine timbre powaqqatsi keyboardist capitalized freq_df_total
label
n 0.001235 0.003704 0.002469 0.004938 0.001235 0.002469 0.019753 0.017284 0.001235 0.001235 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.0
n 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.007246 0.047101 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.0
n 0.003643 0.000000 0.001821 0.000000 0.000000 0.003643 0.010929 0.018215 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.0
n 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.025878 0.020333 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.0
n 0.001190 0.007143 0.001190 0.000000 0.000000 0.000000 0.029762 0.023810 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
p 0.000000 0.002345 0.000000 0.000000 0.000000 0.001172 0.018757 0.029308 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.0
p 0.000000 0.002793 0.000000 0.000000 0.000000 0.002793 0.013966 0.022346 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.0
p 0.000000 0.000000 0.000000 0.000000 0.000000 0.000840 0.026891 0.026050 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.0
p 0.001475 0.000000 0.000000 0.000000 0.000000 0.001475 0.029499 0.010324 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.0
p 0.000898 0.000000 0.001797 0.000000 0.000000 0.000000 0.023360 0.017969 0.000000 0.000000 ... 0.000898 0.000898 0.000898 0.000898 0.000898 0.000898 0.000898 0.000898 0.000898 1.0

2000 rows × 47839 columns

In [ ]: