HW7: Comparing MNB & SVM with Kaggle Sentiment Data¶

OVERVIEW¶

VECTORIZERS USED:¶

CountVectorizer
TfidfVectorizer

MODELS USED:¶

Multinomial Naive Bayes (MNB)
Support Vector Machines (SVM)

VECTORIZATION PARAMS:¶

Binary
Stopwords
Unigrams, Bigrams
Min & Max df

TODO:¶

Stemming?
Vadar + TextBlob

FUNCTION & PACKAGE PARTY¶

## =======================================================
## TOKENIZING
## =======================================================
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize

## =======================================================
## VECTORIZING
## =======================================================
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

## ----- VECTORIZORS
unigram_bool_cv = CountVectorizer(encoding='latin-1', binary=True, min_df=5, stop_words='english', token_pattern=r'(?u)\b[a-zA-Z]{2,}\b' )
unigram_cv = CountVectorizer(encoding='latin-1', binary=False, min_df=5, stop_words='english')
bigram_cv = CountVectorizer(encoding='latin-1', ngram_range=(1,2), min_df=5, stop_words='english')
unigram_tv = TfidfVectorizer(encoding='latin-1', use_idf=True, min_df=5, stop_words='english')
bigram_tv = TfidfVectorizer(encoding='latin-1', use_idf=True, ngram_range=(1,2), min_df=5, stop_words='english')

## =======================================================
## MODELING
## =======================================================
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB, MultinomialNB

## ----- CLASSIFIERS
mnb = MultinomialNB()
svm = LinearSVC(C=1)

def get_test_train_vec(X,y,vectorizer):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)
    return X_train_vec, X_test_vec, y_train, y_test

def run_classifier(X_train_vec, X_test_vec, y_train, y_test, labels, target_names, classifier):
    clf = classifier
    clf.fit(X_train_vec,y_train)
    print(clf.score(X_test_vec,y_test))
    return clf
    
def get_model(X, y, labels, target_names, classifier, vec):
    X_train_vec, X_test_vec, y_train, y_test = get_test_train_vec(X,y,vec)
    model = run_classifier(X_train_vec, X_test_vec, y_train, y_test, labels, target_names, classifier)
    return model
    
## =======================================================
## VISUALIZING
## =======================================================
from tabulate import tabulate
import pandas as pd

def return_features(vec, model):
    for i,feature_probability in enumerate(model.coef_):
        print('============ Sentiment Score: ', i)
        df1 = pd.DataFrame(sorted(zip(feature_probability, vec.get_feature_names()))[:10])
        df2 = pd.DataFrame(sorted(zip(feature_probability, vec.get_feature_names()))[-10:])
        df3 = pd.concat([df1, df2], axis=1)
        print(tabulate(df3, tablefmt="fancy_grid", headers=["Most","Likely","Least","Likely"], floatfmt=".2f"))

DATA GOES HERE:¶

# import pandas as pd
train=pd.read_csv("kaggle-sentiment/train.tsv", delimiter='\t')
y=train['Sentiment'].values
X=train['Phrase'].values

TASK 1¶

vec = unigram_bool_cv
classifier = mnb

model = get_model(X,y,[0,1,2,3,4],['0','1','2','3','4'], classifier, vec)
return_features(vec, model)

0.6069780853517878
============ Sentiment Score:  0
╒════╤════════╤═════════════╤═════════╤════════════╕
│    │   Most │ Likely      │   Least │ Likely     │
╞════╪════════╪═════════════╪═════════╪════════════╡
│  0 │ -10.47 │ aaliyah     │   -5.94 │ time       │
├────┼────────┼─────────────┼─────────┼────────────┤
│  1 │ -10.47 │ abagnale    │   -5.93 │ minutes    │
├────┼────────┼─────────────┼─────────┼────────────┤
│  2 │ -10.47 │ abandoned   │   -5.92 │ characters │
├────┼────────┼─────────────┼─────────┼────────────┤
│  3 │ -10.47 │ abbreviated │   -5.92 │ story      │
├────┼────────┼─────────────┼─────────┼────────────┤
│  4 │ -10.47 │ abel        │   -5.90 │ comedy     │
├────┼────────┼─────────────┼─────────┼────────────┤
│  5 │ -10.47 │ abhors      │   -5.69 │ just       │
├────┼────────┼─────────────┼─────────┼────────────┤
│  6 │ -10.47 │ abiding     │   -5.19 │ like       │
├────┼────────┼─────────────┼─────────┼────────────┤
│  7 │ -10.47 │ ably        │   -5.06 │ bad        │
├────┼────────┼─────────────┼─────────┼────────────┤
│  8 │ -10.47 │ aborted     │   -4.84 │ film       │
├────┼────────┼─────────────┼─────────┼────────────┤
│  9 │ -10.47 │ abrahams    │   -4.31 │ movie      │
╘════╧════════╧═════════════╧═════════╧════════════╛
============ Sentiment Score:  1
╒════╤════════╤═══════════╤═════════╤════════════╕
│    │   Most │ Likely    │   Least │ Likely     │
╞════╪════════╪═══════════╪═════════╪════════════╡
│  0 │ -11.32 │ abagnale  │   -5.73 │ characters │
├────┼────────┼───────────┼─────────┼────────────┤
│  1 │ -11.32 │ abbott    │   -5.73 │ bad        │
├────┼────────┼───────────┼─────────┼────────────┤
│  2 │ -11.32 │ abdul     │   -5.65 │ rrb        │
├────┼────────┼───────────┼─────────┼────────────┤
│  3 │ -11.32 │ abel      │   -5.63 │ little     │
├────┼────────┼───────────┼─────────┼────────────┤
│  4 │ -11.32 │ abilities │   -5.48 │ story      │
├────┼────────┼───────────┼─────────┼────────────┤
│  5 │ -11.32 │ ably      │   -5.44 │ just       │
├────┼────────┼───────────┼─────────┼────────────┤
│  6 │ -11.32 │ abrahams  │   -5.42 │ does       │
├────┼────────┼───────────┼─────────┼────────────┤
│  7 │ -11.32 │ abroad    │   -5.05 │ like       │
├────┼────────┼───────────┼─────────┼────────────┤
│  8 │ -11.32 │ access    │   -4.68 │ film       │
├────┼────────┼───────────┼─────────┼────────────┤
│  9 │ -11.32 │ acclaim   │   -4.57 │ movie      │
╘════╧════════╧═══════════╧═════════╧════════════╛
============ Sentiment Score:  2
╒════╤════════╤═════════════╤═════════╤════════════╕
│    │   Most │ Likely      │   Least │ Likely     │
╞════╪════════╪═════════════╪═════════╪════════════╡
│  0 │ -11.83 │ abroad      │   -5.95 │ movies     │
├────┼────────┼─────────────┼─────────┼────────────┤
│  1 │ -11.83 │ acclaim     │   -5.90 │ characters │
├────┼────────┼─────────────┼─────────┼────────────┤
│  2 │ -11.83 │ acumen      │   -5.78 │ time       │
├────┼────────┼─────────────┼─────────┼────────────┤
│  3 │ -11.83 │ adding      │   -5.78 │ life       │
├────┼────────┼─────────────┼─────────┼────────────┤
│  4 │ -11.83 │ admirers    │   -5.59 │ lrb        │
├────┼────────┼─────────────┼─────────┼────────────┤
│  5 │ -11.83 │ affirms     │   -5.49 │ story      │
├────┼────────┼─────────────┼─────────┼────────────┤
│  6 │ -11.83 │ aggravating │   -5.33 │ rrb        │
├────┼────────┼─────────────┼─────────┼────────────┤
│  7 │ -11.83 │ aimlessly   │   -5.29 │ like       │
├────┼────────┼─────────────┼─────────┼────────────┤
│  8 │ -11.83 │ amaze       │   -4.74 │ movie      │
├────┼────────┼─────────────┼─────────┼────────────┤
│  9 │ -11.83 │ ambiguities │   -4.68 │ film       │
╘════╧════════╧═════════════╧═════════╧════════════╛
============ Sentiment Score:  3
╒════╤════════╤═════════════╤═════════╤══════════╕
│    │   Most │ Likely      │   Least │ Likely   │
╞════╪════════╪═════════════╪═════════╪══════════╡
│  0 │ -11.47 │ aaliyah     │   -5.77 │ lrb      │
├────┼────────┼─────────────┼─────────┼──────────┤
│  1 │ -11.47 │ abbreviated │   -5.75 │ love     │
├────┼────────┼─────────────┼─────────┼──────────┤
│  2 │ -11.47 │ abc         │   -5.68 │ rrb      │
├────┼────────┼─────────────┼─────────┼──────────┤
│  3 │ -11.47 │ abhorrent   │   -5.67 │ life     │
├────┼────────┼─────────────┼─────────┼──────────┤
│  4 │ -11.47 │ abhors      │   -5.57 │ like     │
├────┼────────┼─────────────┼─────────┼──────────┤
│  5 │ -11.47 │ abomination │   -5.50 │ story    │
├────┼────────┼─────────────┼─────────┼──────────┤
│  6 │ -11.47 │ aborted     │   -5.49 │ funny    │
├────┼────────┼─────────────┼─────────┼──────────┤
│  7 │ -11.47 │ abrupt      │   -5.10 │ good     │
├────┼────────┼─────────────┼─────────┼──────────┤
│  8 │ -11.47 │ absent      │   -4.80 │ movie    │
├────┼────────┼─────────────┼─────────┼──────────┤
│  9 │ -11.47 │ absurdities │   -4.48 │ film     │
╘════╧════════╧═════════════╧═════════╧══════════╛
============ Sentiment Score:  4
╒════╤════════╤═════════════╤═════════╤══════════════╕
│    │   Most │ Likely      │   Least │ Likely       │
╞════╪════════╪═════════════╪═════════╪══════════════╡
│  0 │ -10.62 │ aaliyah     │   -5.81 │ performance  │
├────┼────────┼─────────────┼─────────┼──────────────┤
│  1 │ -10.62 │ abagnale    │   -5.77 │ comedy       │
├────┼────────┼─────────────┼─────────┼──────────────┤
│  2 │ -10.62 │ abandoned   │   -5.72 │ great        │
├────┼────────┼─────────────┼─────────┼──────────────┤
│  3 │ -10.62 │ abbott      │   -5.69 │ story        │
├────┼────────┼─────────────┼─────────┼──────────────┤
│  4 │ -10.62 │ abbreviated │   -5.64 │ performances │
├────┼────────┼─────────────┼─────────┼──────────────┤
│  5 │ -10.62 │ abc         │   -5.46 │ good         │
├────┼────────┼─────────────┼─────────┼──────────────┤
│  6 │ -10.62 │ abdul       │   -5.23 │ funny        │
├────┼────────┼─────────────┼─────────┼──────────────┤
│  7 │ -10.62 │ abhorrent   │   -5.14 │ best         │
├────┼────────┼─────────────┼─────────┼──────────────┤
│  8 │ -10.62 │ abhors      │   -4.77 │ movie        │
├────┼────────┼─────────────┼─────────┼──────────────┤
│  9 │ -10.62 │ abiding     │   -4.25 │ film         │
╘════╧════════╧═════════════╧═════════╧══════════════╛