IST736 WK 7 – Async

2 minute read

7.1 Readings

7.2 Document Similarity Measure

QUESTION: Doc1:(book, book, music, video, video) Doc2:(music, music, video) Doc3:(book, book, video)

Use boolean representation; calculate cosine similarity between the documents. Use word frequency representation; calculate again. You can manually calculate them, or revise the sample code to do the calculation in Python.

7.3 SVM for Text Categorization

  1. Why are linear text classifiers popular? (6m) – KNN is sensitive to noise and takes a long time
  2. SVMs for text categorization (10m)
  3. Kernels in SVM (3m)
  4. SVMs for multi-class classification (3m)
  5. SVMs strengths and weaknesses (3m)

Strengths

– great for classification and categorization – usually used as a starting point – high tolerance to noisy data (we can adjust this) – no requirements on input data – can output confidence of prediction (turns distance into a probability) – can scale up to a larger dataset

Weaknesses

– tuning complicatied parameters – easy to explain the linear kernel, harder to explain non-linear kernel

7.4 SVM in Weka

QUESTION:

Build SMO models for movie review classification:

Keep C at its default value (C=1), and compare models with linear kernel and RBF kernel. Which one performs better? Use the winning kernel in Step 1, tune C between 0.1 to 5, all other parameters in default values, and report which C value brings the best performance, using three-fold CV as evaluation method. If you change the StringToWordVector setting, report the parameters that you changed. Note that sometimes changing C does not lead to performance difference. Compare the performance of your best SVMs model so far with a multinomial Naive Bayes model.

RESPONSE:

I’m skipping this Weka section so I can focus on doing this in python with sklearn.

QUESTION:

Build a linear SMO model for 20newsgroup data. Find the feature weights for the binary classifier “talk.politics.misc vs. talk.religion.misc,” which is the last of the 190 binary classifiers. Revise the sample script to print out the 10 most indicative features for “talk.politics.misc” and the 10 most indicative features for “talk.religion.misc.”

7.5 LinearSVC in Sklearn

v1 (8m)

SVC vs LinearSCV (only uses linear kernel)

QUESTION 1:

Complete Exercise A in the sample script. Post your code and output here.

 # Your code starts here
very_positive_10 = feature_ranks[:10]
print("Very positive words")
for i in range(0, len(very_positive_10)):
    print(very_positive_10[i])
print()
Very positive words
(-1.8329269023625736, 'hawke')
(-1.7372807031773951, 'giddy')
(-1.6832953316384267, 'collar')
(-1.5847290994111467, 'swimfan')
(-1.5720764921491963, 'blue')
(-1.4801113025915054, 'dogtown')
(-1.4138361248310338, 'clamoring')
(-1.4093533095129984, 'joan')
(-1.3918159593359454, 'victim')
(-1.340000177793368, 'compulsively')

7.6 Sklearn Feature Engineering

Updated: