{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Python web crawling example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This tutorial is written based on chapters 11-13 from the book \"Python for Everyone\" https://www.py4e.com\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Step 1: import all necessary packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "import urllib\n",
    "from bs4 import BeautifulSoup\n",
    "import pprint\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Step 2: download a sample webpage. \n",
    "You can save the html page onto your computer and use text editor to view its content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = \"http://www.metrolyrics.com/you-belong-with-me-lyrics-taylor-swift.html\"\n",
    "html = urllib.request.urlopen(url).read()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Step 3: use BeautifulSoup to parse the webpage and extract the lyrics content. The division that includes the lyrics starts from the html tag \"lyrics-body-text\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Taylor Swift - You Belong With Me Lyrics | MetroLyrics\n",
      "\n",
      "\n",
      "You're on the phone with your girlfriend—she's upset\n",
      "She's going off about something that you said\n",
      "'Cause she doesn't get your humor like I do.I'm in the room, it's a typical Tuesday night.\n",
      "I'm listening to the kind of music she doesn't like.\n",
      "And she'll never know your story like I doBut she wears short skirts\n",
      "I wear t-shirt\n",
      "She's cheer captain\n",
      "And I'm on the bleachersDreaming about the day when you wake up and find\n",
      "That what you're looking for has been here the whole time.\n",
      "\n",
      "\n",
      "\n",
      "Related\n",
      "\n",
      "\n",
      "\n",
      "\n",
      " \n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "11 Delicious Misheard Lyrics About Food\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "NEW SONG: Taylor Swift - 'Lover' - LYRICS\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "Prime Day concert by Amazon Music will be headlined by Taylor Swift\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      " \n",
      " \n",
      "\n",
      "If you could see\n",
      "That I'm the one\n",
      "Who understands you\n",
      "Been here all along\n",
      "So, why can't you see\n",
      "You belong with me\n",
      "You belong with me?Walk in the streets with you in your worn out jeans\n",
      "I can't help thinking this is how it ought to be.\n",
      "Laughing on a park bench thinking to myself\n",
      "\"Hey, isn't this easy?\"And you've got a smile\n",
      "That can light up this whole town\n",
      "I haven't seen it in awhile\n",
      "Since she brought you down.You say you're fine—I know you better than that\n",
      "Hey, what you doing with a girl like that?\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      " \n",
      "\n",
      "\n",
      "\n",
      "\n",
      "She wears high heels\n",
      "I wear sneakers\n",
      "She's cheer captain\n",
      "And I'm on the bleachersDreaming about the day when you wake up and find\n",
      "That what you're looking for has been here the whole timeIf you could see\n",
      "That I'm the one\n",
      "Who understands you\n",
      "Been here all along\n",
      "So, why can't you see\n",
      "You belong with me?Standing by and waiting at your backdoor.\n",
      "All this time how could you not know, baby?\n",
      "You belong with me\n",
      "You belong with meOh, I remember you driving to my house\n",
      "In the middle of the night\n",
      "I'm the one who makes you laugh\n",
      "When you know you're 'bout to cryI know your favorite songs\n",
      "And you tell me about your dreams\n",
      "Think I know where you belong\n",
      "Think I know it's with meCan't you se that I'm the one\n",
      "Who understands you?\n",
      "Been here all along\n",
      "So, why can't you see\n",
      "You belong with me?Standing by and waiting at your backdoor.\n",
      "All this time how could you not know, baby?\n",
      "You belong with me\n",
      "You belong with me\n",
      "You belong with me\n",
      "Have you ever thought just maybe\n",
      "You belong with me?\n",
      "You belong with me\n",
      "\n",
      "\n",
      "\n",
      "\n",
      " \n",
      "\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "soup = BeautifulSoup(html, 'html.parser')\n",
    "print(soup.title.string)\n",
    "\n",
    "text = soup.body.find_all(id='lyrics-body-text')\n",
    "text = text[0].text\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Step 4: split text into individual words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\"You're\", 'on', 'the', 'phone', 'with', 'your', \"girlfriend—she's\", 'upset', \"She's\", 'going', 'off', 'about', 'something', 'that', 'you', 'said', \"'Cause\", 'she', \"doesn't\", 'get', 'your', 'humor', 'like', 'I', \"do.I'm\", 'in', 'the', 'room,', \"it's\", 'a', 'typical', 'Tuesday', 'night.', \"I'm\", 'listening', 'to', 'the', 'kind', 'of', 'music', 'she', \"doesn't\", 'like.', 'And', \"she'll\", 'never', 'know', 'your', 'story', 'like', 'I', 'doBut', 'she', 'wears', 'short', 'skirts', 'I', 'wear', 't-shirt', \"She's\", 'cheer', 'captain', 'And', \"I'm\", 'on', 'the', 'bleachersDreaming', 'about', 'the', 'day', 'when', 'you', 'wake', 'up', 'and', 'find', 'That', 'what', \"you're\", 'looking', 'for', 'has', 'been', 'here', 'the', 'whole', 'time.', 'Related', '11', 'Delicious', 'Misheard', 'Lyrics', 'About', 'Food', 'NEW', 'SONG:', 'Taylor', 'Swift', '-', \"'Lover'\", '-', 'LYRICS', 'Prime', 'Day', 'concert', 'by', 'Amazon', 'Music', 'will', 'be', 'headlined', 'by', 'Taylor', 'Swift', 'If', 'you', 'could', 'see', 'That', \"I'm\", 'the', 'one', 'Who', 'understands', 'you', 'Been', 'here', 'all', 'along', 'So,', 'why', \"can't\", 'you', 'see', 'You', 'belong', 'with', 'me', 'You', 'belong', 'with', 'me?Walk', 'in', 'the', 'streets', 'with', 'you', 'in', 'your', 'worn', 'out', 'jeans', 'I', \"can't\", 'help', 'thinking', 'this', 'is', 'how', 'it', 'ought', 'to', 'be.', 'Laughing', 'on', 'a', 'park', 'bench', 'thinking', 'to', 'myself', '\"Hey,', \"isn't\", 'this', 'easy?\"And', \"you've\", 'got', 'a', 'smile', 'That', 'can', 'light', 'up', 'this', 'whole', 'town', 'I', \"haven't\", 'seen', 'it', 'in', 'awhile', 'Since', 'she', 'brought', 'you', 'down.You', 'say', \"you're\", 'fine—I', 'know', 'you', 'better', 'than', 'that', 'Hey,', 'what', 'you', 'doing', 'with', 'a', 'girl', 'like', 'that?', 'She', 'wears', 'high', 'heels', 'I', 'wear', 'sneakers', \"She's\", 'cheer', 'captain', 'And', \"I'm\", 'on', 'the', 'bleachersDreaming', 'about', 'the', 'day', 'when', 'you', 'wake', 'up', 'and', 'find', 'That', 'what', \"you're\", 'looking', 'for', 'has', 'been', 'here', 'the', 'whole', 'timeIf', 'you', 'could', 'see', 'That', \"I'm\", 'the', 'one', 'Who', 'understands', 'you', 'Been', 'here', 'all', 'along', 'So,', 'why', \"can't\", 'you', 'see', 'You', 'belong', 'with', 'me?Standing', 'by', 'and', 'waiting', 'at', 'your', 'backdoor.', 'All', 'this', 'time', 'how', 'could', 'you', 'not', 'know,', 'baby?', 'You', 'belong', 'with', 'me', 'You', 'belong', 'with', 'meOh,', 'I', 'remember', 'you', 'driving', 'to', 'my', 'house', 'In', 'the', 'middle', 'of', 'the', 'night', \"I'm\", 'the', 'one', 'who', 'makes', 'you', 'laugh', 'When', 'you', 'know', \"you're\", \"'bout\", 'to', 'cryI', 'know', 'your', 'favorite', 'songs', 'And', 'you', 'tell', 'me', 'about', 'your', 'dreams', 'Think', 'I', 'know', 'where', 'you', 'belong', 'Think', 'I', 'know', \"it's\", 'with', \"meCan't\", 'you', 'se', 'that', \"I'm\", 'the', 'one', 'Who', 'understands', 'you?', 'Been', 'here', 'all', 'along', 'So,', 'why', \"can't\", 'you', 'see', 'You', 'belong', 'with', 'me?Standing', 'by', 'and', 'waiting', 'at', 'your', 'backdoor.', 'All', 'this', 'time', 'how', 'could', 'you', 'not', 'know,', 'baby?', 'You', 'belong', 'with', 'me', 'You', 'belong', 'with', 'me', 'You', 'belong', 'with', 'me', 'Have', 'you', 'ever', 'thought', 'just', 'maybe', 'You', 'belong', 'with', 'me?', 'You', 'belong', 'with', 'me']\n"
     ]
    }
   ],
   "source": [
    "words = text.split()\n",
    "print(words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remove stopwords"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\"You're\", 'on', 'phone', 'with', 'your', \"girlfriend—she's\", 'upset', \"She's\", 'going', 'off', 'about', 'something', 'that', 'you', 'said', \"'Cause\", 'she', \"doesn't\", 'get', 'your', 'humor', 'like', 'I', \"do.I'm\", 'in', 'room,', \"it's\", 'typical', 'Tuesday', 'night.', \"I'm\", 'listening', 'to', 'kind', 'of', 'music', 'she', \"doesn't\", 'like.', 'And', \"she'll\", 'never', 'know', 'your', 'story', 'like', 'I', 'doBut', 'she', 'wears', 'short', 'skirts', 'I', 'wear', 't-shirt', \"She's\", 'cheer', 'captain', 'And', \"I'm\", 'on', 'bleachersDreaming', 'about', 'day', 'when', 'you', 'wake', 'up', 'and', 'find', 'That', 'what', \"you're\", 'looking', 'for', 'has', 'been', 'here', 'whole', 'time.', 'Related', '11', 'Delicious', 'Misheard', 'Lyrics', 'About', 'Food', 'NEW', 'SONG:', 'Taylor', 'Swift', '-', \"'Lover'\", '-', 'LYRICS', 'Prime', 'Day', 'concert', 'by', 'Amazon', 'Music', 'will', 'be', 'headlined', 'by', 'Taylor', 'Swift', 'If', 'you', 'could', 'see', 'That', \"I'm\", 'one', 'Who', 'understands', 'you', 'Been', 'here', 'all', 'along', 'So,', 'why', \"can't\", 'you', 'see', 'You', 'belong', 'with', 'me', 'You', 'belong', 'with', 'me?Walk', 'in', 'streets', 'with', 'you', 'in', 'your', 'worn', 'out', 'jeans', 'I', \"can't\", 'help', 'thinking', 'this', 'how', 'it', 'ought', 'to', 'be.', 'Laughing', 'on', 'park', 'bench', 'thinking', 'to', 'myself', '\"Hey,', \"isn't\", 'this', 'easy?\"And', \"you've\", 'got', 'smile', 'That', 'can', 'light', 'up', 'this', 'whole', 'town', 'I', \"haven't\", 'seen', 'it', 'in', 'awhile', 'Since', 'she', 'brought', 'you', 'down.You', 'say', \"you're\", 'fine—I', 'know', 'you', 'better', 'than', 'that', 'Hey,', 'what', 'you', 'doing', 'with', 'girl', 'like', 'that?', 'She', 'wears', 'high', 'heels', 'I', 'wear', 'sneakers', \"She's\", 'cheer', 'captain', 'And', \"I'm\", 'on', 'bleachersDreaming', 'about', 'day', 'when', 'you', 'wake', 'up', 'and', 'find', 'That', 'what', \"you're\", 'looking', 'for', 'has', 'been', 'here', 'whole', 'timeIf', 'you', 'could', 'see', 'That', \"I'm\", 'one', 'Who', 'understands', 'you', 'Been', 'here', 'all', 'along', 'So,', 'why', \"can't\", 'you', 'see', 'You', 'belong', 'with', 'me?Standing', 'by', 'and', 'waiting', 'at', 'your', 'backdoor.', 'All', 'this', 'time', 'how', 'could', 'you', 'not', 'know,', 'baby?', 'You', 'belong', 'with', 'me', 'You', 'belong', 'with', 'meOh,', 'I', 'remember', 'you', 'driving', 'to', 'my', 'house', 'In', 'middle', 'of', 'night', \"I'm\", 'one', 'who', 'makes', 'you', 'laugh', 'When', 'you', 'know', \"you're\", \"'bout\", 'to', 'cryI', 'know', 'your', 'favorite', 'songs', 'And', 'you', 'tell', 'me', 'about', 'your', 'dreams', 'Think', 'I', 'know', 'where', 'you', 'belong', 'Think', 'I', 'know', \"it's\", 'with', \"meCan't\", 'you', 'se', 'that', \"I'm\", 'one', 'Who', 'understands', 'you?', 'Been', 'here', 'all', 'along', 'So,', 'why', \"can't\", 'you', 'see', 'You', 'belong', 'with', 'me?Standing', 'by', 'and', 'waiting', 'at', 'your', 'backdoor.', 'All', 'this', 'time', 'how', 'could', 'you', 'not', 'know,', 'baby?', 'You', 'belong', 'with', 'me', 'You', 'belong', 'with', 'me', 'You', 'belong', 'with', 'me', 'Have', 'you', 'ever', 'thought', 'just', 'maybe', 'You', 'belong', 'with', 'me?', 'You', 'belong', 'with', 'me']\n"
     ]
    }
   ],
   "source": [
    "stopwords = ['is', 'are', 'the', 'a', 'an']\n",
    "def removeStopwords(wordlist, stopwords):\n",
    "    return [w for w in wordlist if w not in stopwords]\n",
    "words = removeStopwords(words, stopwords)\n",
    "print(words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "count word frequency"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'\"Hey,': 1,\n",
      " \"'Cause\": 1,\n",
      " \"'Lover'\": 1,\n",
      " \"'bout\": 1,\n",
      " '-': 2,\n",
      " '11': 1,\n",
      " 'About': 1,\n",
      " 'All': 2,\n",
      " 'Amazon': 1,\n",
      " 'And': 4,\n",
      " 'Been': 3,\n",
      " 'Day': 1,\n",
      " 'Delicious': 1,\n",
      " 'Food': 1,\n",
      " 'Have': 1,\n",
      " 'Hey,': 1,\n",
      " 'I': 9,\n",
      " \"I'm\": 7,\n",
      " 'If': 1,\n",
      " 'In': 1,\n",
      " 'LYRICS': 1,\n",
      " 'Laughing': 1,\n",
      " 'Lyrics': 1,\n",
      " 'Misheard': 1,\n",
      " 'Music': 1,\n",
      " 'NEW': 1,\n",
      " 'Prime': 1,\n",
      " 'Related': 1,\n",
      " 'SONG:': 1,\n",
      " 'She': 1,\n",
      " \"She's\": 3,\n",
      " 'Since': 1,\n",
      " 'So,': 3,\n",
      " 'Swift': 2,\n",
      " 'Taylor': 2,\n",
      " 'That': 5,\n",
      " 'Think': 2,\n",
      " 'Tuesday': 1,\n",
      " 'When': 1,\n",
      " 'Who': 3,\n",
      " 'You': 11,\n",
      " \"You're\": 1,\n",
      " 'about': 4,\n",
      " 'all': 3,\n",
      " 'along': 3,\n",
      " 'and': 4,\n",
      " 'at': 2,\n",
      " 'awhile': 1,\n",
      " 'baby?': 2,\n",
      " 'backdoor.': 2,\n",
      " 'be': 1,\n",
      " 'be.': 1,\n",
      " 'been': 2,\n",
      " 'belong': 12,\n",
      " 'bench': 1,\n",
      " 'better': 1,\n",
      " 'bleachersDreaming': 2,\n",
      " 'brought': 1,\n",
      " 'by': 4,\n",
      " 'can': 1,\n",
      " \"can't\": 4,\n",
      " 'captain': 2,\n",
      " 'cheer': 2,\n",
      " 'concert': 1,\n",
      " 'could': 4,\n",
      " 'cryI': 1,\n",
      " 'day': 2,\n",
      " \"do.I'm\": 1,\n",
      " 'doBut': 1,\n",
      " \"doesn't\": 2,\n",
      " 'doing': 1,\n",
      " 'down.You': 1,\n",
      " 'dreams': 1,\n",
      " 'driving': 1,\n",
      " 'easy?\"And': 1,\n",
      " 'ever': 1,\n",
      " 'favorite': 1,\n",
      " 'find': 2,\n",
      " 'fine—I': 1,\n",
      " 'for': 2,\n",
      " 'get': 1,\n",
      " 'girl': 1,\n",
      " \"girlfriend—she's\": 1,\n",
      " 'going': 1,\n",
      " 'got': 1,\n",
      " 'has': 2,\n",
      " \"haven't\": 1,\n",
      " 'headlined': 1,\n",
      " 'heels': 1,\n",
      " 'help': 1,\n",
      " 'here': 5,\n",
      " 'high': 1,\n",
      " 'house': 1,\n",
      " 'how': 3,\n",
      " 'humor': 1,\n",
      " 'in': 4,\n",
      " \"isn't\": 1,\n",
      " 'it': 2,\n",
      " \"it's\": 2,\n",
      " 'jeans': 1,\n",
      " 'just': 1,\n",
      " 'kind': 1,\n",
      " 'know': 6,\n",
      " 'know,': 2,\n",
      " 'laugh': 1,\n",
      " 'light': 1,\n",
      " 'like': 3,\n",
      " 'like.': 1,\n",
      " 'listening': 1,\n",
      " 'looking': 2,\n",
      " 'makes': 1,\n",
      " 'maybe': 1,\n",
      " 'me': 7,\n",
      " 'me?': 1,\n",
      " 'me?Standing': 2,\n",
      " 'me?Walk': 1,\n",
      " \"meCan't\": 1,\n",
      " 'meOh,': 1,\n",
      " 'middle': 1,\n",
      " 'music': 1,\n",
      " 'my': 1,\n",
      " 'myself': 1,\n",
      " 'never': 1,\n",
      " 'night': 1,\n",
      " 'night.': 1,\n",
      " 'not': 2,\n",
      " 'of': 2,\n",
      " 'off': 1,\n",
      " 'on': 4,\n",
      " 'one': 4,\n",
      " 'ought': 1,\n",
      " 'out': 1,\n",
      " 'park': 1,\n",
      " 'phone': 1,\n",
      " 'remember': 1,\n",
      " 'room,': 1,\n",
      " 'said': 1,\n",
      " 'say': 1,\n",
      " 'se': 1,\n",
      " 'see': 5,\n",
      " 'seen': 1,\n",
      " 'she': 4,\n",
      " \"she'll\": 1,\n",
      " 'short': 1,\n",
      " 'skirts': 1,\n",
      " 'smile': 1,\n",
      " 'sneakers': 1,\n",
      " 'something': 1,\n",
      " 'songs': 1,\n",
      " 'story': 1,\n",
      " 'streets': 1,\n",
      " 't-shirt': 1,\n",
      " 'tell': 1,\n",
      " 'than': 1,\n",
      " 'that': 3,\n",
      " 'that?': 1,\n",
      " 'thinking': 2,\n",
      " 'this': 5,\n",
      " 'thought': 1,\n",
      " 'time': 2,\n",
      " 'time.': 1,\n",
      " 'timeIf': 1,\n",
      " 'to': 5,\n",
      " 'town': 1,\n",
      " 'typical': 1,\n",
      " 'understands': 3,\n",
      " 'up': 3,\n",
      " 'upset': 1,\n",
      " 'waiting': 2,\n",
      " 'wake': 2,\n",
      " 'wear': 2,\n",
      " 'wears': 2,\n",
      " 'what': 3,\n",
      " 'when': 2,\n",
      " 'where': 1,\n",
      " 'who': 1,\n",
      " 'whole': 3,\n",
      " 'why': 3,\n",
      " 'will': 1,\n",
      " 'with': 15,\n",
      " 'worn': 1,\n",
      " 'you': 23,\n",
      " \"you're\": 4,\n",
      " \"you've\": 1,\n",
      " 'you?': 1,\n",
      " 'your': 8}\n"
     ]
    }
   ],
   "source": [
    "counts = dict()\n",
    "for word in words:\n",
    "    counts[word] = counts.get(word,0) + 1\n",
    "sorted(counts, key=counts.__getitem__, reverse=True)\n",
    "pprint.pprint(counts)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "sort words by frequency"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above method uses loop, which needs quite a lot of programming, and is also slow. \n",
    "The following method uses the dataframe data structure in the pandas package to quickly count and sort words by frequencies. \n",
    "Pandas documentation includes more details on its powerful data structure\n",
    "https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "you          23\n",
      "with         15\n",
      "belong       12\n",
      "You          11\n",
      "I             9\n",
      "             ..\n",
      "listening     1\n",
      "short         1\n",
      "do.I'm        1\n",
      "time.         1\n",
      "just          1\n",
      "Name: word, Length: 186, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "df=pd.DataFrame(words, columns=['word'])\n",
    "x=df[\"word\"].value_counts()\n",
    "pprint.pprint(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "You PRP\n"
     ]
    }
   ],
   "source": [
    "import nltk\n",
    "tokens = nltk.word_tokenize(text)\n",
    "tags = nltk.pos_tag(tokens)\n",
    "print(tags[0][0], tags[0][1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Remove Stopwords"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\"You're\", 'on', 'phone', 'with', 'your', \"girlfriend—she's\", 'upset', \"She's\", 'going', 'off', 'about', 'something', 'that', 'you', 'said', \"'Cause\", 'she', \"doesn't\", 'get', 'your', 'humor', 'like', 'I', \"do.I'm\", 'in', 'room,', \"it's\", 'typical', 'Tuesday', 'night.', \"I'm\", 'listening', 'to', 'kind', 'of', 'music', 'she', \"doesn't\", 'like.', 'And', \"she'll\", 'never', 'know', 'your', 'story', 'like', 'I', 'doBut', 'she', 'wears', 'short', 'skirts', 'I', 'wear', 't-shirt', \"She's\", 'cheer', 'captain', 'And', \"I'm\", 'on', 'bleachersDreaming', 'about', 'day', 'when', 'you', 'wake', 'up', 'and', 'find', 'That', 'what', \"you're\", 'looking', 'for', 'has', 'been', 'here', 'whole', 'time.', 'Related', '11', 'Delicious', 'Misheard', 'Lyrics', 'About', 'Food', 'NEW', 'SONG:', 'Taylor', 'Swift', '-', \"'Lover'\", '-', 'LYRICS', 'Prime', 'Day', 'concert', 'by', 'Amazon', 'Music', 'will', 'be', 'headlined', 'by', 'Taylor', 'Swift', 'If', 'you', 'could', 'see', 'That', \"I'm\", 'one', 'Who', 'understands', 'you', 'Been', 'here', 'all', 'along', 'So,', 'why', \"can't\", 'you', 'see', 'You', 'belong', 'with', 'me', 'You', 'belong', 'with', 'me?Walk', 'in', 'streets', 'with', 'you', 'in', 'your', 'worn', 'out', 'jeans', 'I', \"can't\", 'help', 'thinking', 'this', 'how', 'it', 'ought', 'to', 'be.', 'Laughing', 'on', 'park', 'bench', 'thinking', 'to', 'myself', '\"Hey,', \"isn't\", 'this', 'easy?\"And', \"you've\", 'got', 'smile', 'That', 'can', 'light', 'up', 'this', 'whole', 'town', 'I', \"haven't\", 'seen', 'it', 'in', 'awhile', 'Since', 'she', 'brought', 'you', 'down.You', 'say', \"you're\", 'fine—I', 'know', 'you', 'better', 'than', 'that', 'Hey,', 'what', 'you', 'doing', 'with', 'girl', 'like', 'that?', 'She', 'wears', 'high', 'heels', 'I', 'wear', 'sneakers', \"She's\", 'cheer', 'captain', 'And', \"I'm\", 'on', 'bleachersDreaming', 'about', 'day', 'when', 'you', 'wake', 'up', 'and', 'find', 'That', 'what', \"you're\", 'looking', 'for', 'has', 'been', 'here', 'whole', 'timeIf', 'you', 'could', 'see', 'That', \"I'm\", 'one', 'Who', 'understands', 'you', 'Been', 'here', 'all', 'along', 'So,', 'why', \"can't\", 'you', 'see', 'You', 'belong', 'with', 'me?Standing', 'by', 'and', 'waiting', 'at', 'your', 'backdoor.', 'All', 'this', 'time', 'how', 'could', 'you', 'not', 'know,', 'baby?', 'You', 'belong', 'with', 'me', 'You', 'belong', 'with', 'meOh,', 'I', 'remember', 'you', 'driving', 'to', 'my', 'house', 'In', 'middle', 'of', 'night', \"I'm\", 'one', 'who', 'makes', 'you', 'laugh', 'When', 'you', 'know', \"you're\", \"'bout\", 'to', 'cryI', 'know', 'your', 'favorite', 'songs', 'And', 'you', 'tell', 'me', 'about', 'your', 'dreams', 'Think', 'I', 'know', 'where', 'you', 'belong', 'Think', 'I', 'know', \"it's\", 'with', \"meCan't\", 'you', 'se', 'that', \"I'm\", 'one', 'Who', 'understands', 'you?', 'Been', 'here', 'all', 'along', 'So,', 'why', \"can't\", 'you', 'see', 'You', 'belong', 'with', 'me?Standing', 'by', 'and', 'waiting', 'at', 'your', 'backdoor.', 'All', 'this', 'time', 'how', 'could', 'you', 'not', 'know,', 'baby?', 'You', 'belong', 'with', 'me', 'You', 'belong', 'with', 'me', 'You', 'belong', 'with', 'me', 'Have', 'you', 'ever', 'thought', 'just', 'maybe', 'You', 'belong', 'with', 'me?', 'You', 'belong', 'with', 'me']\n"
     ]
    }
   ],
   "source": [
    "def removeStopwords(wordlist, stopwords):\n",
    "    return [w for w in wordlist if w not in stopwords]\n",
    "words = removeStopwords(words, stopwords)\n",
    "print(words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'\"Hey,': 1,\n",
      " \"'Cause\": 1,\n",
      " \"'Lover'\": 1,\n",
      " \"'bout\": 1,\n",
      " '-': 2,\n",
      " '11': 1,\n",
      " 'About': 1,\n",
      " 'All': 2,\n",
      " 'Amazon': 1,\n",
      " 'And': 4,\n",
      " 'Been': 3,\n",
      " 'Day': 1,\n",
      " 'Delicious': 1,\n",
      " 'Food': 1,\n",
      " 'Have': 1,\n",
      " 'Hey,': 1,\n",
      " 'I': 9,\n",
      " \"I'm\": 7,\n",
      " 'If': 1,\n",
      " 'In': 1,\n",
      " 'LYRICS': 1,\n",
      " 'Laughing': 1,\n",
      " 'Lyrics': 1,\n",
      " 'Misheard': 1,\n",
      " 'Music': 1,\n",
      " 'NEW': 1,\n",
      " 'Prime': 1,\n",
      " 'Related': 1,\n",
      " 'SONG:': 1,\n",
      " 'She': 1,\n",
      " \"She's\": 3,\n",
      " 'Since': 1,\n",
      " 'So,': 3,\n",
      " 'Swift': 2,\n",
      " 'Taylor': 2,\n",
      " 'That': 5,\n",
      " 'Think': 2,\n",
      " 'Tuesday': 1,\n",
      " 'When': 1,\n",
      " 'Who': 3,\n",
      " 'You': 11,\n",
      " \"You're\": 1,\n",
      " 'about': 4,\n",
      " 'all': 3,\n",
      " 'along': 3,\n",
      " 'and': 4,\n",
      " 'at': 2,\n",
      " 'awhile': 1,\n",
      " 'baby?': 2,\n",
      " 'backdoor.': 2,\n",
      " 'be': 1,\n",
      " 'be.': 1,\n",
      " 'been': 2,\n",
      " 'belong': 12,\n",
      " 'bench': 1,\n",
      " 'better': 1,\n",
      " 'bleachersDreaming': 2,\n",
      " 'brought': 1,\n",
      " 'by': 4,\n",
      " 'can': 1,\n",
      " \"can't\": 4,\n",
      " 'captain': 2,\n",
      " 'cheer': 2,\n",
      " 'concert': 1,\n",
      " 'could': 4,\n",
      " 'cryI': 1,\n",
      " 'day': 2,\n",
      " \"do.I'm\": 1,\n",
      " 'doBut': 1,\n",
      " \"doesn't\": 2,\n",
      " 'doing': 1,\n",
      " 'down.You': 1,\n",
      " 'dreams': 1,\n",
      " 'driving': 1,\n",
      " 'easy?\"And': 1,\n",
      " 'ever': 1,\n",
      " 'favorite': 1,\n",
      " 'find': 2,\n",
      " 'fine—I': 1,\n",
      " 'for': 2,\n",
      " 'get': 1,\n",
      " 'girl': 1,\n",
      " \"girlfriend—she's\": 1,\n",
      " 'going': 1,\n",
      " 'got': 1,\n",
      " 'has': 2,\n",
      " \"haven't\": 1,\n",
      " 'headlined': 1,\n",
      " 'heels': 1,\n",
      " 'help': 1,\n",
      " 'here': 5,\n",
      " 'high': 1,\n",
      " 'house': 1,\n",
      " 'how': 3,\n",
      " 'humor': 1,\n",
      " 'in': 4,\n",
      " \"isn't\": 1,\n",
      " 'it': 2,\n",
      " \"it's\": 2,\n",
      " 'jeans': 1,\n",
      " 'just': 1,\n",
      " 'kind': 1,\n",
      " 'know': 6,\n",
      " 'know,': 2,\n",
      " 'laugh': 1,\n",
      " 'light': 1,\n",
      " 'like': 3,\n",
      " 'like.': 1,\n",
      " 'listening': 1,\n",
      " 'looking': 2,\n",
      " 'makes': 1,\n",
      " 'maybe': 1,\n",
      " 'me': 7,\n",
      " 'me?': 1,\n",
      " 'me?Standing': 2,\n",
      " 'me?Walk': 1,\n",
      " \"meCan't\": 1,\n",
      " 'meOh,': 1,\n",
      " 'middle': 1,\n",
      " 'music': 1,\n",
      " 'my': 1,\n",
      " 'myself': 1,\n",
      " 'never': 1,\n",
      " 'night': 1,\n",
      " 'night.': 1,\n",
      " 'not': 2,\n",
      " 'of': 2,\n",
      " 'off': 1,\n",
      " 'on': 4,\n",
      " 'one': 4,\n",
      " 'ought': 1,\n",
      " 'out': 1,\n",
      " 'park': 1,\n",
      " 'phone': 1,\n",
      " 'remember': 1,\n",
      " 'room,': 1,\n",
      " 'said': 1,\n",
      " 'say': 1,\n",
      " 'se': 1,\n",
      " 'see': 5,\n",
      " 'seen': 1,\n",
      " 'she': 4,\n",
      " \"she'll\": 1,\n",
      " 'short': 1,\n",
      " 'skirts': 1,\n",
      " 'smile': 1,\n",
      " 'sneakers': 1,\n",
      " 'something': 1,\n",
      " 'songs': 1,\n",
      " 'story': 1,\n",
      " 'streets': 1,\n",
      " 't-shirt': 1,\n",
      " 'tell': 1,\n",
      " 'than': 1,\n",
      " 'that': 3,\n",
      " 'that?': 1,\n",
      " 'thinking': 2,\n",
      " 'this': 5,\n",
      " 'thought': 1,\n",
      " 'time': 2,\n",
      " 'time.': 1,\n",
      " 'timeIf': 1,\n",
      " 'to': 5,\n",
      " 'town': 1,\n",
      " 'typical': 1,\n",
      " 'understands': 3,\n",
      " 'up': 3,\n",
      " 'upset': 1,\n",
      " 'waiting': 2,\n",
      " 'wake': 2,\n",
      " 'wear': 2,\n",
      " 'wears': 2,\n",
      " 'what': 3,\n",
      " 'when': 2,\n",
      " 'where': 1,\n",
      " 'who': 1,\n",
      " 'whole': 3,\n",
      " 'why': 3,\n",
      " 'will': 1,\n",
      " 'with': 15,\n",
      " 'worn': 1,\n",
      " 'you': 23,\n",
      " \"you're\": 4,\n",
      " \"you've\": 1,\n",
      " 'you?': 1,\n",
      " 'your': 8}\n"
     ]
    }
   ],
   "source": [
    "counts = dict()\n",
    "for word in words:\n",
    "    counts[word] = counts.get(word, 0) + 1\n",
    "sorted(counts, key=counts.__getitem__, reverse=True)\n",
    "pprint.pprint(counts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "you          23\n",
      "with         15\n",
      "belong       12\n",
      "You          11\n",
      "I             9\n",
      "             ..\n",
      "listening     1\n",
      "short         1\n",
      "do.I'm        1\n",
      "time.         1\n",
      "just          1\n",
      "Name: word, Length: 186, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "df = pd.DataFrame(words, columns=['word'])\n",
    "x = df[\"word\"].value_counts()\n",
    "pprint.pprint(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
