HW3 -- DEALING WITH DIRTY DATA¶

STEP 1: Import the dirty data¶

import pandas as pd
dirtyFile = pd.read_csv('moviereviewRAW.csv')
dirtyFile[:2]

STEP 2: Join all the rows together¶

df = pd.DataFrame()
df['all'] = dirtyFile[dirtyFile.columns[0:]].apply(
    lambda x: ','.join(x.dropna().astype(str)),
    axis=1)
df[:4]

STEP 3: Get the label¶

df['label'] = df.apply(lambda x: x['all'][-3], axis=1)
df[:4]

STEP 4: Clean the data¶

def clean_rogue_characters(string):
    exclude = ['\\',"\'",'"']
    string = ''.join(string.split('\\n'))
    string = ''.join(ch for ch in string if ch not in exclude)
    return string

df['all'] = df['all'].apply( lambda x: clean_rogue_characters(x) )
df['all'][0]

'plot : two teen couples go to a church party , drink and then drive . they get into an accident . one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . whats the deal ? watch the movie and  sorta  find out . . . critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didnt snag this one correctly . they seem to have taken this pretty neat concept , but executed it terribly . so what are the problems with the movie ? well , its main problem is that its simply too jumbled . it starts off  normal  but then downshifts into this  fantasy  world in which you , as an audience member , have no idea whats going on . there are dreams , there are characters coming back from the dead , there are others who look like the dead , there are strange apparitions , there are disappearances , there are a looooot of chase scenes , there are tons of weird things that happen , and most of it is simply not explained . now i personally dont mind trying to unravel a film every now and then , but when all it does is give me the same clue over and over again , i get kind of fed up after a while , which is this films biggest problem . its obviously got this big secret to hide , but it seems to want to hide it completely until its final five minutes . and do they make things entertaining , thrilling or even engaging , in the meantime ? not really . the sad part is that the arrow and i both dig on flicks like this , so we actually figured most of it out by the half-way point , so all of the strangeness after that did start to make a little bit of sense , but it still didnt the make the film all that more entertaining . i guess the bottom line with movies like this is that you should always make sure that the audience is  into it  even before they are given the secret password to enter your world of understanding . i mean , showing melissa sagemiller running away from visions for about 20 minutes throughout the movie is just plain lazy ! ! okay , we get it . . . there are people chasing her and we dont know who they are . do we really need to see it over and over again ? how about giving us different scenes offering further insight into all of the strangeness going down in the movie ? apparently , the studio took this film away from its director and chopped it up themselves , and it shows . there mightve been a pretty decent teen mind-fuck movie in here somewhere , but i guess  the suits  decided that turning it into a music video with little edge , would make more sense . the actors are pretty good for the most part , although wes bentley just seemed to be playing the exact same character that he did in american beauty , only in a new neighborhood . but my biggest kudos go out to sagemiller , who holds her own throughout the entire film , and actually has you feeling her characters unraveling . overall , the film doesnt stick because it doesnt entertain , its confusing , it rarely excites and it feels pretty redundant for most of its runtime , despite a pretty cool ending and explanation to all of the craziness that came before it . oh , and by the way , this is not a horror or teen slasher flick . . . its just packaged to look that way because someone is apparently assuming that the genre is still hot with the kids . it also wrapped production two years ago and has been sitting on the shelves ever since . whatever . . . skip it ! wheres joblo coming from ? a nightmare of elm street 3 ( 7/10 ) - blair witch 2 ( 7/10 ) - the crow ( 9/10 ) - the crow : salvation ( 4/10 ) - lost highway ( 10/10 ) - memento ( 10/10 ) - the others ( 9/10 ) - stir of echoes ( 8/10 ) ,neg'

STEP 5: Export to Corpus for analysis¶

def print_to_file(rating, review, num):
    both = review
    output_filename = str(rating) + '_dirty_' + str(num) + '.txt'
    outfile = open(output_filename, 'w')
    outfile.write(both)
    outfile.close()

for num,row in enumerate(df['all']):
    print_to_file(row[-3], row[:-3], num)

STEP 6: Start the party¶

6a. Tokenize & Create Bag of Words¶

from nltk.tokenize import casual_tokenize
from collections import Counter
df['bow'] = df.apply(lambda x: Counter(casual_tokenize(x['all'])), axis=1)

freq_df = pd.DataFrame(df['bow'].tolist())
freq_df = freq_df.fillna(0).astype(int)
freq_df['DF_total'] = freq_df.apply(lambda x: sum(x), axis=1)
freq_df['DF_label'] = df['label']
# freq_df = freq_df.append(df['label'])

freq_df[:5]

NORMALIZING¶

Do I want to normalize on document? On corpus? On positive corpus?

normalized_df = freq_df.copy()
# normalized_df = normalized_df[:10]
# normalized_df.reset_index()
normalized_df_label = normalized_df['DF_label']
normalized_df_no_label = normalized_df.drop('DF_label', axis = 1)
normalized_df_no_label

normalized_df_no_label = normalized_df_no_label.apply(lambda row: row/row['DF_total'] , axis=1)

normalized_df_label

0       n
1       n
2       n
3       n
4       n
       ..
1995    p
1996    p
1997    p
1998    p
1999    p
Name: DF_label, Length: 2000, dtype: object

STEP 7: Naive Bayes¶

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

def get_NB(small_df, labels):
    x_train, x_test, y_train, y_test = train_test_split(small_df.values, labels, test_size=0.3, random_state = 109)

    gnb = GaussianNB()
    gnb.fit(x_train, y_train)
    y_pred = gnb.predict(x_test)
    from sklearn import metrics
    print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

# normalized_df_label = normalized_df['label']
# normalized_df_no_label = normalized_df.drop('label', axis=1)

get_NB(normalized_df_no_label, normalized_df_label)

Accuracy: 0.7033333333333334

normalized_df_no_label

	text	reviewclass	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Unnamed: 9	...	Unnamed: 168	Unnamed: 169	Unnamed: 170	Unnamed: 171	Unnamed: 172	Unnamed: 173	Unnamed: 174	Unnamed: 175	Unnamed: 176	Unnamed: 177
0	'plot : two teen couples go to a church party	drink and then drive . \nthey get into an acc...	but his girlfriend continues to see him in he...	and has nightmares . \nwhat\'s the deal ? \nw...	but presents it in a very bad package . \nwhi...	since i generally applaud films which attempt...	mess with your head and such ( lost highway &...	but there are good and bad ways of making all...	and these folks just didn\'t snag this one co...	but executed it terribly . \nso what are the ...	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	'the happy bastard\'s quick movie review \ndam...	virus still feels very empty	like a movie going for all flash and no subst...	we don\'t know the origin of what took over t...	and	of course	we don\'t know why donald sutherland is stumb...	it\'s just \" hey	let\'s chase these people around with some ro...	even from the likes of curtis . \nyou\'re mor...	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	all
0	'plot : two teen couples go to a church party ...
1	'the happy bastard\'s quick movie review \ndam...
2	'it is movies like these that make a jaded mov...
3	' \" quest for camelot \" is warner bros . \' ...

	all	label
0	'plot : two teen couples go to a church party ...	n
1	'the happy bastard\'s quick movie review \ndam...	n
2	'it is movies like these that make a jaded mov...	n
3	' \" quest for camelot \" is warner bros . \' ...	n

	plot	:	two	teen	couples	go	to	a	church	party	...	DF_total	DF_label
0	1	3	2	4	1	2	16	14	1	1	...	810	n
1	0	0	0	0	0	0	2	13	0	0	...	276	n
2	2	0	1	0	0	2	6	10	0	0	...	549	n
3	0	0	0	0	0	0	14	11	0	0	...	541	n
4	1	6	1	0	0	0	25	20	0	0	...	840	n

	plot	:	two	teen	couples	go	to	a	church	party	...	cage-world	snoots	obstructions	obscuring	tangerine	timbre	powaqqatsi	keyboardist	capitalized	DF_total
0	1	3	2	4	1	2	16	14	1	1	...	0	0	0	0	0	0	0	0	0	810
1	0	0	0	0	0	0	2	13	0	0	...	0	0	0	0	0	0	0	0	0	276
2	2	0	1	0	0	2	6	10	0	0	...	0	0	0	0	0	0	0	0	0	549
3	0	0	0	0	0	0	14	11	0	0	...	0	0	0	0	0	0	0	0	0	541
4	1	6	1	0	0	0	25	20	0	0	...	0	0	0	0	0	0	0	0	0	840
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1995	0	2	0	0	0	1	16	25	0	0	...	0	0	0	0	0	0	0	0	0	853
1996	0	1	0	0	0	1	5	8	0	0	...	0	0	0	0	0	0	0	0	0	358
1997	0	0	0	0	0	1	32	31	0	0	...	0	0	0	0	0	0	0	0	0	1190
1998	1	0	0	0	0	1	20	7	0	0	...	0	0	0	0	0	0	0	0	0	678
1999	1	0	2	0	0	0	26	20	0	0	...	1	1	1	1	1	1	1	1	1	1113

	plot	:	two	teen	couples	go	to	a	church	party	...	DF_total	DF_label
0	1	3	2	4	1	2	16	14	1	1	...	810	n
1	0	0	0	0	0	0	2	13	0	0	...	276	n
2	2	0	1	0	0	2	6	10	0	0	...	549	n
3	0	0	0	0	0	0	14	11	0	0	...	541	n
4	1	6	1	0	0	0	25	20	0	0	...	840	n

	plot	:	two	teen	couples	go	to	a	church	party	...	cage-world	snoots	obstructions	obscuring	tangerine	timbre	powaqqatsi	keyboardist	capitalized	freq_df_total
label
n	0.001235	0.003704	0.002469	0.004938	0.001235	0.002469	0.019753	0.017284	0.001235	0.001235	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.0
n	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.007246	0.047101	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.0
n	0.003643	0.000000	0.001821	0.000000	0.000000	0.003643	0.010929	0.018215	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.0
n	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.025878	0.020333	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.0
n	0.001190	0.007143	0.001190	0.000000	0.000000	0.000000	0.029762	0.023810	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
p	0.000000	0.002345	0.000000	0.000000	0.000000	0.001172	0.018757	0.029308	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.0
p	0.000000	0.002793	0.000000	0.000000	0.000000	0.002793	0.013966	0.022346	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.0
p	0.000000	0.000000	0.000000	0.000000	0.000000	0.000840	0.026891	0.026050	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.0
p	0.001475	0.000000	0.000000	0.000000	0.000000	0.001475	0.029499	0.010324	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.0
p	0.000898	0.000000	0.001797	0.000000	0.000000	0.000000	0.023360	0.017969	0.000000	0.000000	...	0.000898	0.000898	0.000898	0.000898	0.000898	0.000898	0.000898	0.000898	0.000898	1.0

	plot	:	two	teen	couples	go	to	a	church	party	...	DF_total	DF_label
0	1	3	2	4	1	2	16	14	1	1	...	810	n
1	0	0	0	0	0	0	2	13	0	0	...	276	n
2	2	0	1	0	0	2	6	10	0	0	...	549	n
3	0	0	0	0	0	0	14	11	0	0	...	541	n
4	1	6	1	0	0	0	25	20	0	0	...	840	n

	plot	:	two	teen	couples	go	to	a	church	party	...	DF_total	DF_label
0	1	3	2	4	1	2	16	14	1	1	...	810	n
1	0	0	0	0	0	0	2	13	0	0	...	276	n
2	2	0	1	0	0	2	6	10	0	0	...	549	n
3	0	0	0	0	0	0	14	11	0	0	...	541	n
4	1	6	1	0	0	0	25	20	0	0	...	840	n