{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# HW3 -- DEALING WITH DIRTY DATA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 1: Import the dirty data" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textreviewclassUnnamed: 2Unnamed: 3Unnamed: 4Unnamed: 5Unnamed: 6Unnamed: 7Unnamed: 8Unnamed: 9...Unnamed: 168Unnamed: 169Unnamed: 170Unnamed: 171Unnamed: 172Unnamed: 173Unnamed: 174Unnamed: 175Unnamed: 176Unnamed: 177
0'plot : two teen couples go to a church partydrink and then drive . \\nthey get into an acc...but his girlfriend continues to see him in he...and has nightmares . \\nwhat\\'s the deal ? \\nw...but presents it in a very bad package . \\nwhi...since i generally applaud films which attempt...mess with your head and such ( lost highway &...but there are good and bad ways of making all...and these folks just didn\\'t snag this one co...but executed it terribly . \\nso what are the ......NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1'the happy bastard\\'s quick movie review \\ndam...virus still feels very emptylike a movie going for all flash and no subst...we don\\'t know the origin of what took over t...andof coursewe don\\'t know why donald sutherland is stumb...it\\'s just \\\" heylet\\'s chase these people around with some ro...even from the likes of curtis . \\nyou\\'re mor......NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

2 rows × 178 columns

\n", "
" ], "text/plain": [ " text \\\n", "0 'plot : two teen couples go to a church party \n", "1 'the happy bastard\\'s quick movie review \\ndam... \n", "\n", " reviewclass \\\n", "0 drink and then drive . \\nthey get into an acc... \n", "1 virus still feels very empty \n", "\n", " Unnamed: 2 \\\n", "0 but his girlfriend continues to see him in he... \n", "1 like a movie going for all flash and no subst... \n", "\n", " Unnamed: 3 \\\n", "0 and has nightmares . \\nwhat\\'s the deal ? \\nw... \n", "1 we don\\'t know the origin of what took over t... \n", "\n", " Unnamed: 4 \\\n", "0 but presents it in a very bad package . \\nwhi... \n", "1 and \n", "\n", " Unnamed: 5 \\\n", "0 since i generally applaud films which attempt... \n", "1 of course \n", "\n", " Unnamed: 6 \\\n", "0 mess with your head and such ( lost highway &... \n", "1 we don\\'t know why donald sutherland is stumb... \n", "\n", " Unnamed: 7 \\\n", "0 but there are good and bad ways of making all... \n", "1 it\\'s just \\\" hey \n", "\n", " Unnamed: 8 \\\n", "0 and these folks just didn\\'t snag this one co... \n", "1 let\\'s chase these people around with some ro... \n", "\n", " Unnamed: 9 ... Unnamed: 168 \\\n", "0 but executed it terribly . \\nso what are the ... ... NaN \n", "1 even from the likes of curtis . \\nyou\\'re mor... ... NaN \n", "\n", " Unnamed: 169 Unnamed: 170 Unnamed: 171 Unnamed: 172 Unnamed: 173 \\\n", "0 NaN NaN NaN NaN NaN \n", "1 NaN NaN NaN NaN NaN \n", "\n", " Unnamed: 174 Unnamed: 175 Unnamed: 176 Unnamed: 177 \n", "0 NaN NaN NaN NaN \n", "1 NaN NaN NaN NaN \n", "\n", "[2 rows x 178 columns]" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "dirtyFile = pd.read_csv('moviereviewRAW.csv')\n", "dirtyFile[:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 2: Join all the rows together" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
all
0'plot : two teen couples go to a church party ...
1'the happy bastard\\'s quick movie review \\ndam...
2'it is movies like these that make a jaded mov...
3' \\\" quest for camelot \\\" is warner bros . \\' ...
\n", "
" ], "text/plain": [ " all\n", "0 'plot : two teen couples go to a church party ...\n", "1 'the happy bastard\\'s quick movie review \\ndam...\n", "2 'it is movies like these that make a jaded mov...\n", "3 ' \\\" quest for camelot \\\" is warner bros . \\' ..." ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame()\n", "df['all'] = dirtyFile[dirtyFile.columns[0:]].apply(\n", " lambda x: ','.join(x.dropna().astype(str)),\n", " axis=1)\n", "df[:4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 3: Get the label" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
alllabel
0'plot : two teen couples go to a church party ...n
1'the happy bastard\\'s quick movie review \\ndam...n
2'it is movies like these that make a jaded mov...n
3' \\\" quest for camelot \\\" is warner bros . \\' ...n
\n", "
" ], "text/plain": [ " all label\n", "0 'plot : two teen couples go to a church party ... n\n", "1 'the happy bastard\\'s quick movie review \\ndam... n\n", "2 'it is movies like these that make a jaded mov... n\n", "3 ' \\\" quest for camelot \\\" is warner bros . \\' ... n" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['label'] = df.apply(lambda x: x['all'][-3], axis=1)\n", "df[:4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 4: Clean the data" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'plot : two teen couples go to a church party , drink and then drive . they get into an accident . one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . whats the deal ? watch the movie and sorta find out . . . critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didnt snag this one correctly . they seem to have taken this pretty neat concept , but executed it terribly . so what are the problems with the movie ? well , its main problem is that its simply too jumbled . it starts off normal but then downshifts into this fantasy world in which you , as an audience member , have no idea whats going on . there are dreams , there are characters coming back from the dead , there are others who look like the dead , there are strange apparitions , there are disappearances , there are a looooot of chase scenes , there are tons of weird things that happen , and most of it is simply not explained . now i personally dont mind trying to unravel a film every now and then , but when all it does is give me the same clue over and over again , i get kind of fed up after a while , which is this films biggest problem . its obviously got this big secret to hide , but it seems to want to hide it completely until its final five minutes . and do they make things entertaining , thrilling or even engaging , in the meantime ? not really . the sad part is that the arrow and i both dig on flicks like this , so we actually figured most of it out by the half-way point , so all of the strangeness after that did start to make a little bit of sense , but it still didnt the make the film all that more entertaining . i guess the bottom line with movies like this is that you should always make sure that the audience is into it even before they are given the secret password to enter your world of understanding . i mean , showing melissa sagemiller running away from visions for about 20 minutes throughout the movie is just plain lazy ! ! okay , we get it . . . there are people chasing her and we dont know who they are . do we really need to see it over and over again ? how about giving us different scenes offering further insight into all of the strangeness going down in the movie ? apparently , the studio took this film away from its director and chopped it up themselves , and it shows . there mightve been a pretty decent teen mind-fuck movie in here somewhere , but i guess the suits decided that turning it into a music video with little edge , would make more sense . the actors are pretty good for the most part , although wes bentley just seemed to be playing the exact same character that he did in american beauty , only in a new neighborhood . but my biggest kudos go out to sagemiller , who holds her own throughout the entire film , and actually has you feeling her characters unraveling . overall , the film doesnt stick because it doesnt entertain , its confusing , it rarely excites and it feels pretty redundant for most of its runtime , despite a pretty cool ending and explanation to all of the craziness that came before it . oh , and by the way , this is not a horror or teen slasher flick . . . its just packaged to look that way because someone is apparently assuming that the genre is still hot with the kids . it also wrapped production two years ago and has been sitting on the shelves ever since . whatever . . . skip it ! wheres joblo coming from ? a nightmare of elm street 3 ( 7/10 ) - blair witch 2 ( 7/10 ) - the crow ( 9/10 ) - the crow : salvation ( 4/10 ) - lost highway ( 10/10 ) - memento ( 10/10 ) - the others ( 9/10 ) - stir of echoes ( 8/10 ) ,neg'" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def clean_rogue_characters(string):\n", " exclude = ['\\\\',\"\\'\",'\"']\n", " string = ''.join(string.split('\\\\n'))\n", " string = ''.join(ch for ch in string if ch not in exclude)\n", " return string\n", "\n", "df['all'] = df['all'].apply( lambda x: clean_rogue_characters(x) )\n", "df['all'][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 5: Export to Corpus for analysis " ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "def print_to_file(rating, review, num):\n", " both = review\n", " output_filename = str(rating) + '_dirty_' + str(num) + '.txt'\n", " outfile = open(output_filename, 'w')\n", " outfile.write(both)\n", " outfile.close()\n", "\n", "for num,row in enumerate(df['all']):\n", " print_to_file(row[-3], row[:-3], num)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 6: Start the party" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6a. Tokenize & Create Bag of Words" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "from nltk.tokenize import casual_tokenize\n", "from collections import Counter\n", "df['bow'] = df.apply(lambda x: Counter(casual_tokenize(x['all'])), axis=1)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "freq_df = pd.DataFrame(df['bow'].tolist())\n", "freq_df = freq_df.fillna(0).astype(int)\n", "freq_df['DF_total'] = freq_df.apply(lambda x: sum(x), axis=1)\n", "freq_df['DF_label'] = df['label']\n", "# freq_df = freq_df.append(df['label'])\n" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
plot:twoteencouplesgotoachurchparty...snootsobstructionsobscuringtangerinetimbrepowaqqatsikeyboardistcapitalizedDF_totalDF_label
0132412161411...00000000810n
100000021300...00000000276n
220100261000...00000000549n
3000000141100...00000000541n
4161000252000...00000000840n
\n", "

5 rows × 47841 columns

\n", "
" ], "text/plain": [ " plot : two teen couples go to a church party ... snoots \\\n", "0 1 3 2 4 1 2 16 14 1 1 ... 0 \n", "1 0 0 0 0 0 0 2 13 0 0 ... 0 \n", "2 2 0 1 0 0 2 6 10 0 0 ... 0 \n", "3 0 0 0 0 0 0 14 11 0 0 ... 0 \n", "4 1 6 1 0 0 0 25 20 0 0 ... 0 \n", "\n", " obstructions obscuring tangerine timbre powaqqatsi keyboardist \\\n", "0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 \n", "\n", " capitalized DF_total DF_label \n", "0 0 810 n \n", "1 0 276 n \n", "2 0 549 n \n", "3 0 541 n \n", "4 0 840 n \n", "\n", "[5 rows x 47841 columns]" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "freq_df[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NORMALIZING\n", "Do I want to normalize on document?\n", "On corpus?\n", "On positive corpus? " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
plot:twoteencouplesgotoachurchparty...cage-worldsnootsobstructionsobscuringtangerinetimbrepowaqqatsikeyboardistcapitalizedDF_total
0132412161411...000000000810
100000021300...000000000276
220100261000...000000000549
3000000141100...000000000541
4161000252000...000000000840
..................................................................
1995020001162500...000000000853
19960100015800...000000000358
1997000001323100...0000000001190
199810000120700...000000000678
1999102000262000...1111111111113
\n", "

2000 rows × 47840 columns

\n", "
" ], "text/plain": [ " plot : two teen couples go to a church party ... cage-world \\\n", "0 1 3 2 4 1 2 16 14 1 1 ... 0 \n", "1 0 0 0 0 0 0 2 13 0 0 ... 0 \n", "2 2 0 1 0 0 2 6 10 0 0 ... 0 \n", "3 0 0 0 0 0 0 14 11 0 0 ... 0 \n", "4 1 6 1 0 0 0 25 20 0 0 ... 0 \n", "... ... .. ... ... ... .. .. .. ... ... ... ... \n", "1995 0 2 0 0 0 1 16 25 0 0 ... 0 \n", "1996 0 1 0 0 0 1 5 8 0 0 ... 0 \n", "1997 0 0 0 0 0 1 32 31 0 0 ... 0 \n", "1998 1 0 0 0 0 1 20 7 0 0 ... 0 \n", "1999 1 0 2 0 0 0 26 20 0 0 ... 1 \n", "\n", " snoots obstructions obscuring tangerine timbre powaqqatsi \\\n", "0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 \n", "... ... ... ... ... ... ... \n", "1995 0 0 0 0 0 0 \n", "1996 0 0 0 0 0 0 \n", "1997 0 0 0 0 0 0 \n", "1998 0 0 0 0 0 0 \n", "1999 1 1 1 1 1 1 \n", "\n", " keyboardist capitalized DF_total \n", "0 0 0 810 \n", "1 0 0 276 \n", "2 0 0 549 \n", "3 0 0 541 \n", "4 0 0 840 \n", "... ... ... ... \n", "1995 0 0 853 \n", "1996 0 0 358 \n", "1997 0 0 1190 \n", "1998 0 0 678 \n", "1999 1 1 1113 \n", "\n", "[2000 rows x 47840 columns]" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalized_df = freq_df.copy()\n", "# normalized_df = normalized_df[:10]\n", "# normalized_df.reset_index()\n", "normalized_df_label = normalized_df['DF_label']\n", "normalized_df_no_label = normalized_df.drop('DF_label', axis = 1)\n", "normalized_df_no_label" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "normalized_df_no_label = normalized_df_no_label.apply(lambda row: row/row['DF_total'] , axis=1)" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 n\n", "1 n\n", "2 n\n", "3 n\n", "4 n\n", " ..\n", "1995 p\n", "1996 p\n", "1997 p\n", "1998 p\n", "1999 p\n", "Name: DF_label, Length: 2000, dtype: object" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalized_df_label" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 7: Naive Bayes" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "from sklearn.naive_bayes import GaussianNB\n", "\n", "def get_NB(small_df, labels):\n", " x_train, x_test, y_train, y_test = train_test_split(small_df.values, labels, test_size=0.3, random_state = 109)\n", "\n", " gnb = GaussianNB()\n", " gnb.fit(x_train, y_train)\n", " y_pred = gnb.predict(x_test)\n", " from sklearn import metrics\n", " print(\"Accuracy:\", metrics.accuracy_score(y_test, y_pred))" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "# normalized_df_label = normalized_df['label']\n", "# normalized_df_no_label = normalized_df.drop('label', axis=1)" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 0.7033333333333334\n" ] } ], "source": [ "get_NB(normalized_df_no_label, normalized_df_label)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
plot:twoteencouplesgotoachurchparty...cage-worldsnootsobstructionsobscuringtangerinetimbrepowaqqatsikeyboardistcapitalizedfreq_df_total
label
n0.0012350.0037040.0024690.0049380.0012350.0024690.0197530.0172840.0012350.001235...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000001.0
n0.0000000.0000000.0000000.0000000.0000000.0000000.0072460.0471010.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000001.0
n0.0036430.0000000.0018210.0000000.0000000.0036430.0109290.0182150.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000001.0
n0.0000000.0000000.0000000.0000000.0000000.0000000.0258780.0203330.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000001.0
n0.0011900.0071430.0011900.0000000.0000000.0000000.0297620.0238100.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000001.0
..................................................................
p0.0000000.0023450.0000000.0000000.0000000.0011720.0187570.0293080.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000001.0
p0.0000000.0027930.0000000.0000000.0000000.0027930.0139660.0223460.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000001.0
p0.0000000.0000000.0000000.0000000.0000000.0008400.0268910.0260500.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000001.0
p0.0014750.0000000.0000000.0000000.0000000.0014750.0294990.0103240.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000001.0
p0.0008980.0000000.0017970.0000000.0000000.0000000.0233600.0179690.0000000.000000...0.0008980.0008980.0008980.0008980.0008980.0008980.0008980.0008980.0008981.0
\n", "

2000 rows × 47839 columns

\n", "
" ], "text/plain": [ " plot : two teen couples go to \\\n", "label \n", "n 0.001235 0.003704 0.002469 0.004938 0.001235 0.002469 0.019753 \n", "n 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.007246 \n", "n 0.003643 0.000000 0.001821 0.000000 0.000000 0.003643 0.010929 \n", "n 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.025878 \n", "n 0.001190 0.007143 0.001190 0.000000 0.000000 0.000000 0.029762 \n", "... ... ... ... ... ... ... ... \n", "p 0.000000 0.002345 0.000000 0.000000 0.000000 0.001172 0.018757 \n", "p 0.000000 0.002793 0.000000 0.000000 0.000000 0.002793 0.013966 \n", "p 0.000000 0.000000 0.000000 0.000000 0.000000 0.000840 0.026891 \n", "p 0.001475 0.000000 0.000000 0.000000 0.000000 0.001475 0.029499 \n", "p 0.000898 0.000000 0.001797 0.000000 0.000000 0.000000 0.023360 \n", "\n", " a church party ... cage-world snoots obstructions \\\n", "label ... \n", "n 0.017284 0.001235 0.001235 ... 0.000000 0.000000 0.000000 \n", "n 0.047101 0.000000 0.000000 ... 0.000000 0.000000 0.000000 \n", "n 0.018215 0.000000 0.000000 ... 0.000000 0.000000 0.000000 \n", "n 0.020333 0.000000 0.000000 ... 0.000000 0.000000 0.000000 \n", "n 0.023810 0.000000 0.000000 ... 0.000000 0.000000 0.000000 \n", "... ... ... ... ... ... ... ... \n", "p 0.029308 0.000000 0.000000 ... 0.000000 0.000000 0.000000 \n", "p 0.022346 0.000000 0.000000 ... 0.000000 0.000000 0.000000 \n", "p 0.026050 0.000000 0.000000 ... 0.000000 0.000000 0.000000 \n", "p 0.010324 0.000000 0.000000 ... 0.000000 0.000000 0.000000 \n", "p 0.017969 0.000000 0.000000 ... 0.000898 0.000898 0.000898 \n", "\n", " obscuring tangerine timbre powaqqatsi keyboardist capitalized \\\n", "label \n", "n 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "n 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "n 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "n 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "n 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "... ... ... ... ... ... ... \n", "p 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "p 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "p 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "p 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "p 0.000898 0.000898 0.000898 0.000898 0.000898 0.000898 \n", "\n", " freq_df_total \n", "label \n", "n 1.0 \n", "n 1.0 \n", "n 1.0 \n", "n 1.0 \n", "n 1.0 \n", "... ... \n", "p 1.0 \n", "p 1.0 \n", "p 1.0 \n", "p 1.0 \n", "p 1.0 \n", "\n", "[2000 rows x 47839 columns]" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalized_df_no_label" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }