HW7: Comparing MNB and SVMs

INTRODUCTION

MNB and SVM

How do we take something with 3000 columns and turn it into something meaninful? In short, we, as humans, can't. But computers can!

ANALYSIS & MODELS

About the Data

In [46]:
## =======================================================
## IMPORTING
## =======================================================

import os
def get_data_from_files(path):
    directory = os.listdir(path)
    results = []
    for file in directory:
        f=open(path+file)
        results.append(f.read())
        f.close()
    return results


## =======================================================
## MACHINE LEARNING
## =======================================================

def do_the_xy(x,y,labels, target_names):
    from sklearn.model_selection import train_test_split

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

    print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
    print(X_train[0])
    print(y_train[0])
    print(X_test[0])
    print(y_test[0])

    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfVectorizer

    # several commonly used vectorizer setting

    #  unigram boolean vectorizer, set minimum document frequency to 5
    unigram_bool_vectorizer = CountVectorizer(encoding='latin-1', binary=True, min_df=5, stop_words='english')

    #  unigram term frequency vectorizer, set minimum document frequency to 5
    unigram_count_vectorizer = CountVectorizer(encoding='latin-1', binary=False, min_df=5, stop_words='english')

    #  unigram and bigram term frequency vectorizer, set minimum document frequency to 5
    gram12_count_vectorizer = CountVectorizer(encoding='latin-1', ngram_range=(1,2), min_df=5, stop_words='english')

    #  unigram tfidf vectorizer, set minimum document frequency to 5
    unigram_tfidf_vectorizer = TfidfVectorizer(encoding='latin-1', use_idf=True, min_df=5, stop_words='english')

    X_train_vec = unigram_count_vectorizer.fit_transform(X_train)
    X_test_vec = unigram_count_vectorizer.transform(X_test)


    # import the LinearSVC module
    from sklearn.svm import LinearSVC

    # initialize the LinearSVC model
    svm_clf = LinearSVC(C=1)

    # use the training data to train the model
    svm_clf.fit(X_train_vec,y_train)

    from sklearn.metrics import confusion_matrix
    y_pred = svm_clf.predict(X_test_vec)
    cm=confusion_matrix(y_test, y_pred, labels=labels)
    print(cm)
    print()

    from sklearn.metrics import classification_report
#     target_names = 
    target_names = target_names
    print(classification_report(y_test, y_pred, target_names=target_names))

    svm_confidence_scores = svm_clf.decision_function(X_test_vec)
    ## get the confidence score for the first test example
    print(svm_confidence_scores[0])
    print(svm_clf.score(X_test_vec,y_test))
In [ ]:
 

With Kaggle Sentiment Data

In [47]:
import pandas as pd
train=pd.read_csv("kaggle-sentiment/train.tsv", delimiter='\t')
y=train['Sentiment'].values
X=train['Phrase'].values
do_the_xy(X,y,[0,1,2,3,4],['0','1','2','3','4'])
(93636,) (93636,) (62424,) (62424,)
almost in a class with that of Wilde
3
escape movie
2
[[  918  1221   697    82    13]
 [  701  4080  5504   514    25]
 [  195  2106 27081  2310   172]
 [   34   396  6048  5533  1057]
 [    3    51   590  1772  1321]]

              precision    recall  f1-score   support

           0       0.50      0.31      0.38      2931
           1       0.52      0.38      0.44     10824
           2       0.68      0.85      0.75     31864
           3       0.54      0.42      0.48     13068
           4       0.51      0.35      0.42      3737

    accuracy                           0.62     62424
   macro avg       0.55      0.46      0.49     62424
weighted avg       0.60      0.62      0.60     62424

[-1.01718404 -0.50760032  0.22331211 -0.97514731 -1.24718844]
0.6236864026656415

With Joker Review Data, Extremes

In [48]:
import pandas as pd
import numpy as np
neg_df = pd.DataFrame(neg)
pos_df = pd.DataFrame(pos)
neg = get_data_from_files('../NEG_JK_E/')
pos = get_data_from_files('../POS_JK_E/')
pos_df['PoN'] = 'P'
neg_df['PoN'] = 'N'
all_df = neg_df.append(pos_df)
y=all_df['PoN'].values
X=all_df[0].values
do_the_xy(X,y,['P','N'],['P','N'])
(58,) (58,) (40,) (40,)
 What idiotic FIlm
I can say that Phoenix is master actor. Bt this does still not make a great movie. And thsi movie lives from blabla in every medium.This has nothing to do with the joker and the only thing behind is to make money with bad and good media. Todd Philips should maybe read some comics and don;t copy movies like taxi driver or similar.
N
 Unpopular opinion: Terrible movie
Just watched the movie, and I had to make an account just to put it out there, after hearing tons and tons of praises, that this movie has failed terribly in living up to the hype! I know that my opinion is extremely unpopular, but someone needs to say it. Joaquin Phoenix is brilliant as the joker, in regards to his acting, but that is probably the only positive thing about this movie. And I believe that a movie needs much more than just good acting. This movie does not even have a proper climate, and is basically about a mental illness, where I feel there is no hint of the charismatic, sick and brilliant joker we got to see in the batman series!
N
[[11 13]
 [ 4 12]]

              precision    recall  f1-score   support

           P       0.48      0.75      0.59        16
           N       0.73      0.46      0.56        24

    accuracy                           0.57        40
   macro avg       0.61      0.60      0.57        40
weighted avg       0.63      0.57      0.57        40

0.5249280105427925
0.575
In [ ]:
 
In [24]: