IST736 WK5

3 minute read

Week 5 · Text Classifier Evaluation and Human Annotation

5.1 Readings

5.2 Model Overfitting

Training error Testing error

sklearn train test plot

8 min vid Repeat the steps in the previous video to create a training set and a test set from the movie review data. Compare the training error and test error. Do you think overfitting occurs in this case?

Training classifier Evaluating NaiveBayesClassifier results… Accuracy: 0.8125 F-measure [neg]: 0.8259860788863109 F-measure [pos]: 0.7967479674796748 Precision [neg]: 0.7705627705627706 Precision [pos]: 0.8698224852071006 Recall [neg]: 0.89 Recall [pos]: 0.735

This is what I got from the movie review data – I don’t think overfitting is happening here.

5.3 Evaluation Methods

8 min vid x2 Evaluate a multinomialNB model on movie review classification data using five-fold cross validation. Report the accuracy.

Using one of my new favs “no shared words” functions –

No Shared Words Accuracy: 0.7866666666666666 Accuracy: 0.8016666666666666 Accuracy: 0.805 Accuracy: 0.8 Accuracy: 0.8166666666666667 AVERAGE ACCURACY: 0.8019999999999999


speed robustness (how does it handle noise) scalability interpretability

Use your own words to summarize the MNB algorithm’s advantages in multiple aspects. Describe each advantage and explain why.

I really appreciated the “Interpretability” aspect that was highlighted in the video – something I’d never considered before (in the same way I considered speed or scalability). The ability to “look under the hood” at what MNB is doing allows us as researchers to do additional self-checking of our model and better understand what’s happening so we don’t end up with a situation like the professor where she originally collected the names of fellow congresspersons.

5.4 Evaluation Measures

5 min vid x2

Random Guess Baseline

Random guess baseline is 20% but because this is an uneven dataset, majority vote is preferable.

The distribution of a five-category data set is:

[A: 13, B: 24, C: 78, D:33, E:9].

To evaluate a classification model on this data set, what is the accuracy of a random guess baseline? What is the accuracy of a majority vote baseline? Which baseline is suitable for this evaluation?

MORE EVALUATION MEASURES

  • Precision
  • Recall
    • How many are correctly predicted?
  • F measure (combination of precision and recall)
  • Confusion matrix (contingency table to compare prediction results)

RECALL 70 / 70 + 10 = 7/8 10 / 10 + 10 = 1/2

PRECISION: 70 / 70 + 10 = 7/8 10 / 10 + 10 = 1/2

F1-Measure = 2(7/8)(7/8)/(7/8+7/8)

I had a bunch of confusion matrices from past homeworks so I plopped one into the online tool and got this: 0.2185

5.5 Training Label Acquisition

11 min vid x2

Use sklearn.metrics.cohen_kappa_score to calculate kappa agreement between your NLTK and SentiStrength results in Homework 1. See attached sample code as an example.

If you have created the confusion matrix, you can also use this online tool to calculate kappa.

5.6 Crowdsourcing Annotations

8 min vid

I tested my DIY Summarizer on this paper! Here were the results:

We investigate five tasks: affect recognition, word similarity, recognizing textual entailment, event temporal ordering, and word sense disambiguation.The tasks are: affect recognition, word similarity, recognizing textual entailment, event temporal ordering, and word sense disambiguation.Dolores Labs Blog, “AMT is fast, cheap, and good for machine learning data,” Brendan O’Connor, Sept. 9, 2008.4 Annotation Tasks We analyze the quality of non-expert annotations on five tasks: affect recognition, word similarity, recognizing textual entailment, temporal event recognition, and word sense disambiguation.We did this by averaging the labels of each possible subset of n non-expert annotations, for value of n in {1, 2,… , 10}.

Clearly, I need to improve my methods!! This will likely include passing parameters such as “wikipedia” or “white paper” or “book” and more specific… cleaners.

Go to the AMT website, and browse or search the HITs. Describe a task that you find interesting.

Optional: If you start early, you may be able to get approval as a worker and do some annotations yourself.

This sounds interesting – Find the name, email, and LinkedIn profile for relevant finance professionals – simply because I can already outline my Beautiful Soup scraper in my head!! I also have experience working from a list (I’m assuming this would include a csv of the ‘finance professionals’) and scraping from there. This sounds fun! But not for 0.02… but then again I did much more for no money on HW3…

Updated: