IST736 WK4

3 minute read

Week 4: Naive Bayes for Text Categorization

4.1 Readings

4.2 Review of Vectorization

I wish my understanding of computer hardware was more up to snuff to better understand exactly what the first part of this question is – however, I can definitely explain TF-IDF!

TF = Term Frequency. This is how frequent the term occurs within its own document.

DF = Document Frequency. This is how many document contain the term.

IDF = Inverse Document Frequency is the inverse of the DF, so if the term appears in 2 out of 100 documents the IDF is 100/2 .

I would get the TF by using something like Counter.

I would get the DF by checking each document to see if that word is present in the document. If so, increment.

I would get the IDF by taking the total number of documents and dividing it by the DF.

4.3 Decision Tree for Text Categorization

Use Weka’s FilteredClassifier to combine J48 algorithm and StringToWordVector to build a decision tree classifier for Movie Review Sentiment Classification. Use Weka’s percentage split evaluation method with default setting. Report (1) the classification accuracy, (2) size of the tree, and (3) time taken to build the model.

4.4 Naive Bayes for Text Categorization

Multinomial Naive Bayes for Text Categorization

Two common NB models for text classification

  • Multinomial model (use word frequency)
  • Benouli model (use word presence/absence)

4.4.2

Go to the online book, Introduction to Information Retrieval, section “naïve Bayse text classification.” Search “worked example” inside the page. Read the worked example. Show calculation for predicting the test example, “Macao Chinese Japan Shanghai Syracuse.”

My – naive– understanding of Naive Bayes (which was helped but also hurt by this online book) makes me think we are using past data to predict future outcomes. We see “Chinese” in every single document, “Japan” in two documents, “Shanghai” and “Macao” both only in one (and “Syracuse in none!). Common sense would dictate that the greater presence of “Chinese” in the “China” documents outweighs the negative “Tokyo Japan” in the “Not-China” documents. This is confirmed when we run a working example on our dataset:

First, we have to adjust the numbers because now we have SEVEN terms (instead of six) so our new formula looks like this: P(Chinese|c) = (5+1) / (8+7) = 6/15 = 2/5 P(Tokyo|c) = P(Japan|c) = (0+1)/(8+7) = 1/15 P(Shanghai|c) = P(Macao|c) = (1+1)/(8+7) = 2/15 P(Chinese|_c) = (1+1)/(3+7) = 2/10 = 1/5 P(Tokyo|_c) = P(Japan|c) = (0+1)/(3+7) = 2/10 = 1/5 P(Shanghai|_c) = P(Macao|_c) = (0+1)/(3+7) = 1/10 Annnd after doing that I realized we didn’t need Tokyo… But that’s ok because we need it for Japan

P(c) = 2/152/51/152/15 = 0.00047 P(_c) = 1/101/51/51/10 = 0.0004

So It would predict it IS in China documents


Go to the online book, Introduction to Information Retrieval, section “The Benoulli Model.” Search “worked example” inside the page. Read the worked example. Show calculation for predicting the test example, “Macao Chinese Japan Shanghai Syracuse.”

Since we are now only looking at “is in document, is not in document” our formula is slightly different –

OCCURRENCE: 3/42/54/51/5 *(1-2/5)(1-2/5) = 0.017 NON OCCURRENCE: 1/41/32/32/3(1-1/3)*(1-1/3) = 0.016

4.5 Multinomial Naive Bayes in Weka

4.5.3 Use MultinomialNB to build a sentiment classification model for movie review classification. Use percentage-split evaluation, and report the accuracy and time taken to build the model. Compare the multinomialNB model with the decision tree model using the these two performance measures.

So because the world is weird and wonderful, I actually already did this (like, used this exact dataset) for HW1 and HW2 as a benchmark https://danielcaraway.github.io/html/HW1_ALL_v3.html

Here is what I got for NB: Training classifier Evaluating NaiveBayesClassifier results… Accuracy: 0.8125 F-measure [neg]: 0.8259860788863109 F-measure [pos]: 0.7967479674796748 Precision [neg]: 0.7705627705627706 Precision [pos]: 0.8698224852071006 Recall [neg]: 0.89 Recall [pos]: 0.735

4.6 Feature Ranking in MultinomialNB

4.6.2 Build a MultinomialNB model for 20newsgroup classification. Make sure to remove stopwords. Edit the Weka output to a tsv file. Revise the sample script to print out 10 words with highest conditional probabilities in the category “alt.atheism.” Are these words indicative of the topic “atheism,” or are they just common words in all categories?

better tutorial

Oops I didn’t remove stopwords! alt.atheism: edu it and in you that is of to the

4.7 Multinomial NB vs Multinomial NBText

Updated: