{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial - build MNB with sklearn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial demonstrates how to use the Sci-kit Learn (sklearn) package to build Multinomial Naive Bayes model, rank features, and use the model for prediction. \n", "\n", "The data from the Kaggle Sentiment Analysis on Movie Review Competition are used in this tutorial. Check out the details of the data and the competition on Kaggle.\n", "https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews\n", "\n", "The tutorial also includes sample code to prepare your prediction result for submission to Kaggle. Although the competition is over, you can still submit your prediction to get an evaluation score." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This revised script changed two places in the original script:\n", "(1) Exercise A: replaced the outdated \"itemfreq\" function with new one so no more warnings\n", "(2) Exercise C: replace the \"coef_\" variable with the \"feature_log_prob_\" variable. Although the sklearn manual said coef_ mirrors feature_log_prob_, we found that in case of binary classification, coef_ has only one dimension but feature_log_prob_ keeps the original two dimensions of positive and negative conditional probs. The code was also cleaned and simplified.# Step 1: Read in data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | PhraseId | \n", "SentenceId | \n", "Phrase | \n", "Sentiment | \n", "
---|---|---|---|---|
0 | \n", "1 | \n", "1 | \n", "A series of escapades demonstrating the adage ... | \n", "1 | \n", "
1 | \n", "2 | \n", "1 | \n", "A series of escapades demonstrating the adage ... | \n", "2 | \n", "
2 | \n", "3 | \n", "1 | \n", "A series | \n", "2 | \n", "
3 | \n", "4 | \n", "1 | \n", "A | \n", "2 | \n", "
4 | \n", "5 | \n", "1 | \n", "series | \n", "2 | \n", "