IST736 WK 4
Class Notes
- There are more questions than answers, that is OK
- In this class, tokens are words
- can be bigrams
- Sentences are harder to get a freq count
Naive Bayes
- Supervised learning method
- Works on high dimensional data (like SVM)
- Like text data!!
- NB can analyzie quantitive and qualitative data types (unlike SVM)
- Question to ask yourself: Should we normalize?
- Normally, we must normalize
- Once you get more than four or five categories, models aren’t effective
The closer you get to unballanced data the less reliable your model is
- Question 1: What are my labels?
- Question 2: Do I normalize?
- Question 3: Is my data balanced? (do I have 10 neg, 1000 pos?)
- Question 4: What can I do if I don’t have balanced data? Build a new smaller dataset (resample)
“What’s the probability of being a bird if I can fly” Given this data vector, what’s the probability that it’s a mammal
“Why is Naive Bayes, ‘Naive’? Because it assumes independence”
- Easiest way to nmormalize is turn each document into a row – normalize by word and len(doc)
- Needs labeled data
- Clustering
“What type of data is that model willing to accept and in what format?”
P([a1, a2, a2] |c) P(a1|c) * P(a2|c) * P(a3|c) P(A and B) = P(A) * P(B)
Reduce dimensionality AND improve results IMPROVE NB by remove columns that are highly dependent on other columns (tried it with Iris)
NAIVE BAYES calculates probabilities compares them
- YouTube
RE COUNT VECTORIZER: It is a class – re instantiated each time I call it I can use multiple CV4, CV5 etc.
MultinomialNaive bayes = fininte num of with labels We have to instantiate MyModelNB = MultinomialNB()
Assignment is like a spec doc Two different labels
Use 10-fold cross validation methods (10 confusion matricies) We’re going to first do lie detection THEN sentiment
- Separate out into sentiment and lie (two separate dfs) 10 fold for both sentiment and lie Is it easier to predict sentiment or honesty