{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial - kMeans document clustering with sklearn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(85, 72)\n",
      "(85, 70)\n",
      "\n",
      "(77, 72)\n",
      "(77, 70)\n"
     ]
    }
   ],
   "source": [
    "# read in the csv file that contains function words and their normalized frequencies\n",
    "import pandas as pd\n",
    "from sklearn.cluster import KMeans\n",
    "papers2 = pd.read_csv('fedPapers85.csv')\n",
    "author_labels = papers2['author']\n",
    "filenames = papers2['filename']\n",
    "papers2_vecs = papers2.drop(['author', 'filename'], axis=1)\n",
    "print(papers2.shape)\n",
    "print(papers2_vecs.shape)\n",
    "print()\n",
    "\n",
    "H_subset = papers2[papers2['author']=='Hamilton']\n",
    "M_subset = papers2[papers2['author']=='Madison']\n",
    "D_subset = papers2[papers2['author']=='dispt']\n",
    "frames = [H_subset, M_subset, D_subset]\n",
    "HMD_subset = pd.concat(frames)\n",
    "HMD_subset = HMD_subset.reset_index(drop=True)\n",
    "\n",
    "HMD_subset_labels = HMD_subset['author'].values\n",
    "HMD_subset_filenames = HMD_subset['filename']\n",
    "HMD_vecs = HMD_subset.drop(['author', 'filename'], axis=1)\n",
    "print(HMD_subset.shape)\n",
    "print(HMD_vecs.shape)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "col_0      0   1\n",
      "row_0           \n",
      "Hamilton  26  25\n",
      "Madison    8   7\n",
      "dispt      5   6\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# clustering using hamilton and madison essays only\n",
    "k = 2\n",
    "km = KMeans(n_clusters=k, algorithm='auto', init='random', n_init=20, random_state=0, verbose=False)\n",
    "km.fit(HMD_vecs)\n",
    "\n",
    "cm = pd.crosstab(HMD_subset_labels, km.labels_)\n",
    "print(cm)\n",
    "print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This clustering result was not able to distinguish Hamilton and Madison's writing styles. It also attributes the authorship of disputed essays evenly to Hamilton and Madison.\n",
    "\n",
    "Now let's try using zscore transformation to transform all attribute values and redo kMeans clustering. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "col_0      0   1\n",
      "row_0           \n",
      "Hamilton   6  45\n",
      "Madison   15   0\n",
      "dispt     11   0\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from scipy.stats import zscore\n",
    "\n",
    "HMD_vecs = HMD_vecs.apply(zscore)\n",
    "\n",
    "k = 2\n",
    "km2 = KMeans(n_clusters=k, algorithm='auto', init='random', n_init=5, random_state=0, verbose=False)\n",
    "km2.fit(HMD_vecs)\n",
    "cm2 = pd.crosstab(HMD_subset_labels, km2.labels_)\n",
    "print(cm2)\n",
    "print()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "col_0      0   1\n",
      "author          \n",
      "HM         0   3\n",
      "Hamilton  48   3\n",
      "Jay        0   5\n",
      "Madison    0  15\n",
      "dispt      1  10\n"
     ]
    }
   ],
   "source": [
    "# clustering using all essays and all features\n",
    "# transform feature values with zscore transformation\n",
    "from scipy.stats import zscore\n",
    "papers2_vecs = papers2_vecs.apply(zscore)\n",
    "k=2\n",
    "km3 = KMeans(n_clusters=k, algorithm='auto', init='random', n_init=20, random_state=0, verbose=False)\n",
    "km3.fit(papers2_vecs)\n",
    "cm3 = pd.crosstab(author_labels, km3.labels_)\n",
    "print(cm3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This clustering result makes much better sense by clustering almost all Hamilton's essays into one cluster. It also shows the disputed essays were more similar to Madison's essays. \n",
    "\n",
    "This study shows that sklearn's kMeans algorithm does not do normalization as Weka's kMeans does. So users need to normalize the data themsevles.\n",
    "\n",
    "See this stackoverflow discussion on normalization for kmeans in sklearn:\n",
    "https://stackoverflow.com/questions/20027645/does-kmeans-normalize-features-automatically-in-sklearn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
