{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# HW3 -- DEALING WITH DIRTY DATA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 1: Import the dirty data" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | text | \n", "reviewclass | \n", "Unnamed: 2 | \n", "Unnamed: 3 | \n", "Unnamed: 4 | \n", "Unnamed: 5 | \n", "Unnamed: 6 | \n", "Unnamed: 7 | \n", "Unnamed: 8 | \n", "Unnamed: 9 | \n", "... | \n", "Unnamed: 168 | \n", "Unnamed: 169 | \n", "Unnamed: 170 | \n", "Unnamed: 171 | \n", "Unnamed: 172 | \n", "Unnamed: 173 | \n", "Unnamed: 174 | \n", "Unnamed: 175 | \n", "Unnamed: 176 | \n", "Unnamed: 177 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "'plot : two teen couples go to a church party | \n", "drink and then drive . \\nthey get into an acc... | \n", "but his girlfriend continues to see him in he... | \n", "and has nightmares . \\nwhat\\'s the deal ? \\nw... | \n", "but presents it in a very bad package . \\nwhi... | \n", "since i generally applaud films which attempt... | \n", "mess with your head and such ( lost highway &... | \n", "but there are good and bad ways of making all... | \n", "and these folks just didn\\'t snag this one co... | \n", "but executed it terribly . \\nso what are the ... | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
1 | \n", "'the happy bastard\\'s quick movie review \\ndam... | \n", "virus still feels very empty | \n", "like a movie going for all flash and no subst... | \n", "we don\\'t know the origin of what took over t... | \n", "and | \n", "of course | \n", "we don\\'t know why donald sutherland is stumb... | \n", "it\\'s just \\\" hey | \n", "let\\'s chase these people around with some ro... | \n", "even from the likes of curtis . \\nyou\\'re mor... | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
2 rows × 178 columns
\n", "\n", " | all | \n", "
---|---|
0 | \n", "'plot : two teen couples go to a church party ... | \n", "
1 | \n", "'the happy bastard\\'s quick movie review \\ndam... | \n", "
2 | \n", "'it is movies like these that make a jaded mov... | \n", "
3 | \n", "' \\\" quest for camelot \\\" is warner bros . \\' ... | \n", "
\n", " | all | \n", "label | \n", "
---|---|---|
0 | \n", "'plot : two teen couples go to a church party ... | \n", "n | \n", "
1 | \n", "'the happy bastard\\'s quick movie review \\ndam... | \n", "n | \n", "
2 | \n", "'it is movies like these that make a jaded mov... | \n", "n | \n", "
3 | \n", "' \\\" quest for camelot \\\" is warner bros . \\' ... | \n", "n | \n", "
\n", " | plot | \n", ": | \n", "two | \n", "teen | \n", "couples | \n", "go | \n", "to | \n", "a | \n", "church | \n", "party | \n", "... | \n", "snoots | \n", "obstructions | \n", "obscuring | \n", "tangerine | \n", "timbre | \n", "powaqqatsi | \n", "keyboardist | \n", "capitalized | \n", "DF_total | \n", "DF_label | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "3 | \n", "2 | \n", "4 | \n", "1 | \n", "2 | \n", "16 | \n", "14 | \n", "1 | \n", "1 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "810 | \n", "n | \n", "
1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "2 | \n", "13 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "276 | \n", "n | \n", "
2 | \n", "2 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "2 | \n", "6 | \n", "10 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "549 | \n", "n | \n", "
3 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "14 | \n", "11 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "541 | \n", "n | \n", "
4 | \n", "1 | \n", "6 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "25 | \n", "20 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "840 | \n", "n | \n", "
5 rows × 47841 columns
\n", "\n", " | plot | \n", ": | \n", "two | \n", "teen | \n", "couples | \n", "go | \n", "to | \n", "a | \n", "church | \n", "party | \n", "... | \n", "cage-world | \n", "snoots | \n", "obstructions | \n", "obscuring | \n", "tangerine | \n", "timbre | \n", "powaqqatsi | \n", "keyboardist | \n", "capitalized | \n", "DF_total | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "3 | \n", "2 | \n", "4 | \n", "1 | \n", "2 | \n", "16 | \n", "14 | \n", "1 | \n", "1 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "810 | \n", "
1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "2 | \n", "13 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "276 | \n", "
2 | \n", "2 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "2 | \n", "6 | \n", "10 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "549 | \n", "
3 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "14 | \n", "11 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "541 | \n", "
4 | \n", "1 | \n", "6 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "25 | \n", "20 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "840 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
1995 | \n", "0 | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "16 | \n", "25 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "853 | \n", "
1996 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "5 | \n", "8 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "358 | \n", "
1997 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "32 | \n", "31 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1190 | \n", "
1998 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "20 | \n", "7 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "678 | \n", "
1999 | \n", "1 | \n", "0 | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "26 | \n", "20 | \n", "0 | \n", "0 | \n", "... | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1113 | \n", "
2000 rows × 47840 columns
\n", "\n", " | plot | \n", ": | \n", "two | \n", "teen | \n", "couples | \n", "go | \n", "to | \n", "a | \n", "church | \n", "party | \n", "... | \n", "cage-world | \n", "snoots | \n", "obstructions | \n", "obscuring | \n", "tangerine | \n", "timbre | \n", "powaqqatsi | \n", "keyboardist | \n", "capitalized | \n", "freq_df_total | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
label | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
n | \n", "0.001235 | \n", "0.003704 | \n", "0.002469 | \n", "0.004938 | \n", "0.001235 | \n", "0.002469 | \n", "0.019753 | \n", "0.017284 | \n", "0.001235 | \n", "0.001235 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "1.0 | \n", "
n | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.007246 | \n", "0.047101 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "1.0 | \n", "
n | \n", "0.003643 | \n", "0.000000 | \n", "0.001821 | \n", "0.000000 | \n", "0.000000 | \n", "0.003643 | \n", "0.010929 | \n", "0.018215 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "1.0 | \n", "
n | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.025878 | \n", "0.020333 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "1.0 | \n", "
n | \n", "0.001190 | \n", "0.007143 | \n", "0.001190 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.029762 | \n", "0.023810 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "1.0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
p | \n", "0.000000 | \n", "0.002345 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.001172 | \n", "0.018757 | \n", "0.029308 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "1.0 | \n", "
p | \n", "0.000000 | \n", "0.002793 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.002793 | \n", "0.013966 | \n", "0.022346 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "1.0 | \n", "
p | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000840 | \n", "0.026891 | \n", "0.026050 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "1.0 | \n", "
p | \n", "0.001475 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.001475 | \n", "0.029499 | \n", "0.010324 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "1.0 | \n", "
p | \n", "0.000898 | \n", "0.000000 | \n", "0.001797 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.023360 | \n", "0.017969 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000898 | \n", "0.000898 | \n", "0.000898 | \n", "0.000898 | \n", "0.000898 | \n", "0.000898 | \n", "0.000898 | \n", "0.000898 | \n", "0.000898 | \n", "1.0 | \n", "
2000 rows × 47839 columns
\n", "