IST736 WK 4

1 minute read

Class Notes

Supervised learning method
Works on high dimensional data (like SVM)
Like text data!!
NB can analyzie quantitive and qualitative data types (unlike SVM)
Question to ask yourself: Should we normalize?
Normally, we must normalize
Once you get more than four or five categories, models aren’t effective
The closer you get to unballanced data the less reliable your model is
Question 1: What are my labels?
Question 2: Do I normalize?
Question 3: Is my data balanced? (do I have 10 neg, 1000 pos?)
Question 4: What can I do if I don’t have balanced data? Build a new smaller dataset (resample)

“What’s the probability of being a bird if I can fly” Given this data vector, what’s the probability that it’s a mammal

“Why is Naive Bayes, ‘Naive’? Because it assumes independence”

Easiest way to nmormalize is turn each document into a row – normalize by word and len(doc)

“What type of data is that model willing to accept and in what format?”

P([a1, a2, a2] |c) P(a1|c) * P(a2|c) * P(a3|c) P(A and B) = P(A) * P(B)

Reduce dimensionality AND improve results IMPROVE NB by remove columns that are highly dependent on other columns (tried it with Iris)

NAIVE BAYES calculates probabilities compares them

RE COUNT VECTORIZER: It is a class – re instantiated each time I call it I can use multiple CV4, CV5 etc.

MultinomialNaive bayes = fininte num of with labels We have to instantiate MyModelNB = MultinomialNB()

Assignment is like a spec doc Two different labels

Use 10-fold cross validation methods (10 confusion matricies) We’re going to first do lie detection THEN sentiment

Separate out into sentiment and lie (two separate dfs) 10 fold for both sentiment and lie Is it easier to predict sentiment or honesty