IST736 WK 6 – Async
WK 6 scikit-learn
6.1 Readings
NONE
6.2 Data Preparation
VIDEO: 13 m
NOTES:
- Big X is training CONTENT
- Little y is training LABEL
RESPONSE: In the sample script block Exercise A, follow the instructions to write code and test it. Post your code here.
6.3 Vectorization
VIDEO: 11 min
NOTES:
- TFIDF
- CountVectorizer – encoding can be latin-1 or utf-8 or anything, depends on dataset
QUESTION: Follow the instruction in the sample script block Exercise B. Write and test code for boolean vectorization. Post your code here.
RESPONSE:
This kept giving me this error
NotFittedError: CountVectorizer - Vocabulary wasn’t fitted.
test_bool = unigram_bool_vectorizer.transform(X_test)
train_bool = unigram_bool_vectorizer.transform(X_train)
So I did this
import sklearn vocabulary_to_load = X_test loaded_vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(1, 2), binary = True, min_df=1, vocabulary=vocabulary_to_load) loaded_vectorizer._validate_vocabulary() print(‘loaded_vectorizer.get_feature_names(): {0}’. format(loaded_vectorizer.get_feature_names()))
6.4 Training and Explaining MultinomialNB Model
VIDEO: 7 min
RESPONSE: Follow the instruction in the sample script block Exercise C. Write and test code. Post your code here.
6.5 Use Model for Prediction and Explain Prediction Result
VIDEO: 10 min
QUESTION: Follow instructions in the sample script block Exercise D. Write and test code. Post your code and your error analysis comments here.
RESPONSE:
err_cnt = 0
for i in range(0, len(y_test)):
if(y_test[i]==1 and y_pred[i]==4):
print(X_test[i])
err_cnt = err_cnt+1
print("errors:", err_cnt)
Negation seems to be a good next step for this one, too.
6.6 Kaggle Submission
VIDEO: 5 min
RESPONSE: Follow the instruction in the sample script block Exercise E. Report your result here.
6.7 Compare BernoulliNB and MultinomialNB With Cross Validation
VIDEO: 5 min
RESPONSE: Follow the instructions in the sample script block Exercise F. Post your code and result here.