HW7: Comparing MNB and SVMs

INTRODUCTION

MNB and SVM

How do we take something with 3000 columns and turn it into something meaninful? In short, we, as humans, can't. But computers can!

ANALYSIS & MODELS

About the Data

In [55]:
## =======================================================
## IMPORTING
## =======================================================
import os
def get_data_from_files(path):
    directory = os.listdir(path)
    results = []
    for file in directory:
        f=open(path+file)
        results.append(f.read())
        f.close()
    return results

## =======================================================
## MACHINE LEARNING
## =======================================================
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.svm import LinearSVC


def do_the_xy(x,y,labels, target_names):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

    unigram_bool_vectorizer = CountVectorizer(encoding='latin-1', binary=True, min_df=5, stop_words='english')
    unigram_count_vectorizer = CountVectorizer(encoding='latin-1', binary=False, min_df=5, stop_words='english')
    gram12_count_vectorizer = CountVectorizer(encoding='latin-1', ngram_range=(1,2), min_df=5, stop_words='english')
    unigram_tfidf_vectorizer = TfidfVectorizer(encoding='latin-1', use_idf=True, min_df=5, stop_words='english')

#     X_train_vec = unigram_count_vectorizer.fit_transform(X_train)
#     X_test_vec = unigram_count_vectorizer.transform(X_test)

    X_train_vec = unigram_tfidf_vectorizer.fit_transform(X_train)
    X_test_vec = unigram_tfidf_vectorizer.transform(X_test)
    
    svm_clf = LinearSVC(C=1)
    svm_clf.fit(X_train_vec,y_train)

    y_pred = svm_clf.predict(X_test_vec)
    cm=confusion_matrix(y_test, y_pred, labels=labels)
    print('=====CONFUSION MATRIX=====')
    print(cm)

    target_names = target_names
    print('=====CLASSIFICATION REPORT=====')
    print(classification_report(y_test, y_pred, target_names=target_names))

    svm_confidence_scores = svm_clf.decision_function(X_test_vec)
    print('=====CONFIDENCE SCORES=====')
    print(svm_confidence_scores[0])
    print('=====SCORES=====')
    print(svm_clf.score(X_test_vec,y_test))
In [ ]:
 

With Kaggle Sentiment Data

In [56]:
import pandas as pd
train=pd.read_csv("kaggle-sentiment/train.tsv", delimiter='\t')
y=train['Sentiment'].values
X=train['Phrase'].values
do_the_xy(X,y,[0,1,2,3,4],['0','1','2','3','4'])
=====CONFUSION MATRIX=====
[[  795  1387   624   117     8]
 [  589  4336  5245   629    25]
 [  163  2299 26557  2684   161]
 [   24   408  5604  6220   812]
 [    2    40   551  2010  1134]]
=====CLASSIFICATION REPORT=====
              precision    recall  f1-score   support

           0       0.51      0.27      0.35      2931
           1       0.51      0.40      0.45     10824
           2       0.69      0.83      0.75     31864
           3       0.53      0.48      0.50     13068
           4       0.53      0.30      0.39      3737

    accuracy                           0.63     62424
   macro avg       0.55      0.46      0.49     62424
weighted avg       0.61      0.63      0.61     62424

=====CONFIDENCE SCORES=====
[-1.01488498 -0.38031991  0.16542161 -0.9704731  -1.23293715]
=====SCORES=====
0.6254325259515571

With Joker Review Data, Extremes

In [57]:
import pandas as pd
import numpy as np
neg_df = pd.DataFrame(neg)
pos_df = pd.DataFrame(pos)
neg = get_data_from_files('../NEG_JK_E/')
pos = get_data_from_files('../POS_JK_E/')
pos_df['PoN'] = 'P'
neg_df['PoN'] = 'N'
all_df = neg_df.append(pos_df)
y=all_df['PoN'].values
X=all_df[0].values
do_the_xy(X,y,['P','N'],['P','N'])
=====CONFUSION MATRIX=====
[[13 11]
 [ 4 12]]
=====CLASSIFICATION REPORT=====
              precision    recall  f1-score   support

           P       0.52      0.75      0.62        16
           N       0.76      0.54      0.63        24

    accuracy                           0.62        40
   macro avg       0.64      0.65      0.62        40
weighted avg       0.67      0.62      0.63        40

=====CONFIDENCE SCORES=====
0.05681948962308103
=====SCORES=====
0.625
In [ ]:
 
In [24]: