IST736 WK 4

1 minute read

Class Notes

  • There are more questions than answers, that is OK

TOKENS:

  1. In this class, tokens are words
  2. can be bigrams
  3. Sentences are harder to get a freq count

Naive Bayes

  • Supervised learning method
  • Works on high dimensional data (like SVM)
  • Like text data!!
  • NB can analyzie quantitive and qualitative data types (unlike SVM)
  • Question to ask yourself: Should we normalize?
  • Normally, we must normalize
  • Once you get more than four or five categories, models aren’t effective
  • The closer you get to unballanced data the less reliable your model is

  • Question 1: What are my labels?
  • Question 2: Do I normalize?
  • Question 3: Is my data balanced? (do I have 10 neg, 1000 pos?)
  • Question 4: What can I do if I don’t have balanced data? Build a new smaller dataset (resample)

“What’s the probability of being a bird if I can fly” Given this data vector, what’s the probability that it’s a mammal

“Why is Naive Bayes, ‘Naive’? Because it assumes independence”

NORMALIZING:

  • Easiest way to nmormalize is turn each document into a row – normalize by word and len(doc)

Supervised

  • Needs labeled data

Unsupervised

  • Clustering
  • ARM

“What type of data is that model willing to accept and in what format?”

NOTES FROM CHAT:

P([a1, a2, a2] |c)
P(a1|c) * P(a2|c) * P(a3|c) P(A and B) = P(A) * P(B)

Reduce dimensionality AND improve results IMPROVE NB by remove columns that are highly dependent on other columns (tried it with Iris)

NAIVE BAYES calculates probabilities compares them

TO DO:

  1. YouTube CountVectorizer

RE COUNT VECTORIZER: It is a class – re instantiated each time I call it I can use multiple CV4, CV5 etc.

MultinomialNaive bayes = fininte num of with labels We have to instantiate MyModelNB = MultinomialNB()

Assignment is like a spec doc Two different labels

Use 10-fold cross validation methods (10 confusion matricies) We’re going to first do lie detection THEN sentiment

  1. Separate out into sentiment and lie (two separate dfs) 10 fold for both sentiment and lie Is it easier to predict sentiment or honesty

Updated: