HOW TO SUMMARIZE IN PYTHON

Following this tutorial! | 10-13-19

STEP 1: GET THE DATA!!

Step 1a: Import libraries

In [2]:
import bs4 as bs
import urllib.request
import re

Step 1b: Use the libraries to scrape the WHOLE INTERNET!! (jk just this page)

In [97]:
# url = 'https://en.wikipedia.org/wiki/Lizard'
url = 'https://en.wikipedia.org/wiki/cat'
# url = 'https://en.wikipedia.org/wiki/Machine_learning' # good at 20 words
# url = 'https://en.wikipedia.org/wiki/Artificial_intelligence' # good at 30 words
# scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Artificial_intelligence')
# scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Harry_Potter_and_the_Philosopher%27s_Stone')
scraped_data = urllib.request.urlopen(url)
article = scraped_data.read()
parsed_article = bs.BeautifulSoup(article,'lxml')

Step 1c: Use find_all from BeautifulSoup to get all of the p tags

In [98]:
paragraphs = parsed_article.find_all('p')
article_text = ""
for p in paragraphs:
    article_text += p.text
In [99]:
article_text[:1000]
Out[99]:
'\n\nThe cat (Felis catus) is a small carnivorous mammal.[1][2] It is the only domesticated species in the family Felidae and often referred to as the domestic cat to distinguish it from wild members of the family.[4] The cat is either a house cat or a farm cat, which are pets, or a feral cat, which ranges freely and avoids human contact.[5]\nA house cat is valued by humans for companionship and for its ability to hunt rodents. About 60 cat breeds are recognized by various cat registries.[6]\nThe cat is similar in anatomy to the other felid species, has a strong flexible body, quick reflexes, sharp teeth and retractable claws adapted to killing small prey. Its night vision and sense of smell are well developed. Cat communication includes vocalizations like meowing, purring, trilling, hissing, growling and grunting as well as cat-specific body language. It is a solitary hunter, but a social species. It can hear sounds too faint or too high in frequency for human ears, such as those made by m'

STEP 2: CLEAN (& preprocess) THE DATA!!

Step 2a: Use regex and re.sub to remove square brackets and extra spaces from ORIGINAL article_text

In [100]:
article_text = re.sub(r'\[[0-9]*\]', '', article_text)
article_text = re.sub(r'\s+', ' ', article_text)
article_text[:1000]
Out[100]:
' The cat (Felis catus) is a small carnivorous mammal. It is the only domesticated species in the family Felidae and often referred to as the domestic cat to distinguish it from wild members of the family. The cat is either a house cat or a farm cat, which are pets, or a feral cat, which ranges freely and avoids human contact. A house cat is valued by humans for companionship and for its ability to hunt rodents. About 60 cat breeds are recognized by various cat registries. The cat is similar in anatomy to the other felid species, has a strong flexible body, quick reflexes, sharp teeth and retractable claws adapted to killing small prey. Its night vision and sense of smell are well developed. Cat communication includes vocalizations like meowing, purring, trilling, hissing, growling and grunting as well as cat-specific body language. It is a solitary hunter, but a social species. It can hear sounds too faint or too high in frequency for human ears, such as those made by mice and other sm'

Step 2b: Use regex and re.sub to remove extra characters and digits for a new FORMATTED_TEXT variable

In [101]:
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text)
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)
In [102]:
formatted_article_text[:1000]
Out[102]:
' The cat Felis catus is a small carnivorous mammal It is the only domesticated species in the family Felidae and often referred to as the domestic cat to distinguish it from wild members of the family The cat is either a house cat or a farm cat which are pets or a feral cat which ranges freely and avoids human contact A house cat is valued by humans for companionship and for its ability to hunt rodents About cat breeds are recognized by various cat registries The cat is similar in anatomy to the other felid species has a strong flexible body quick reflexes sharp teeth and retractable claws adapted to killing small prey Its night vision and sense of smell are well developed Cat communication includes vocalizations like meowing purring trilling hissing growling and grunting as well as cat specific body language It is a solitary hunter but a social species It can hear sounds too faint or too high in frequency for human ears such as those made by mice and other small mammals It is a predat'

STEP 3: TOKENIZE SENTENCES!!

In [103]:
import nltk
sentence_list = nltk.sent_tokenize(article_text)
sentence_list[:5]
Out[103]:
[' The cat (Felis catus) is a small carnivorous mammal.',
 'It is the only domesticated species in the family Felidae and often referred to as the domestic cat to distinguish it from wild members of the family.',
 'The cat is either a house cat or a farm cat, which are pets, or a feral cat, which ranges freely and avoids human contact.',
 'A house cat is valued by humans for companionship and for its ability to hunt rodents.',
 'About 60 cat breeds are recognized by various cat registries.']

STEP 4: FIND WORD FREQUENCY, WEIGHTED!!

Step 4a: Remove Stopwords

In [104]:
stopwords = nltk.corpus.stopwords.words('english')

Step 4b: Tokenize Words & DIY Frequency Distribution

In [105]:
word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

Step 4c: Calculate Weighted Frequency

In [106]:
max_frequency = max(word_frequencies.values())
for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/max_frequency)

STEP 5: CALCULATE SENTENCE SCORES

In [107]:
## ILLUSTRATIVE EXAMPLE
## Nothing removed
for sent in sentence_list[:1]:
    for word in nltk.word_tokenize(sent.lower()):
        print(word)
the
cat
(
felis
catus
)
is
a
small
carnivorous
mammal
.
In [108]:
## ILLUSTRATIVE EXAMPLE
## Stopwords etc. removed
## We are ONLY assigning values/weights to the words in the sentences that are inside our freq dist!

for sent in sentence_list[:1]:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            print(word)
cat
catus
small
carnivorous
mammal
In [113]:
sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower())[:50]:
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]
In [114]:
sorted_sentences = sorted(sentence_scores.items(), key=lambda kv: kv[1], reverse=True)
sorted_sentences[:10]
Out[114]:
[('The cat is either a house cat or a farm cat, which are pets, or a feral cat, which ranges freely and avoids human contact.',
  3.326388888888889),
 ('Despite widespread concern about the welfare of free-roaming cats, the lifespans of neutered feral cats in managed colonies compare favorably with those of pet cats.',
  3.2986111111111116),
 ('Domestic cats are bred and shown as registered pedigreed cats, a hobby known as cat fancy.',
  3.118055555555556),
 ('House cats often mate with feral cats, producing hybrids such as the Kellas cat in Scotland.',
  3.055555555555556),
 ('Life in proximity to humans and other domestic animals has led to a symbiotic social adaptation in cats, and cats may express great affection toward humans or other animals.',
  3.0416666666666665),
 ('Spaying or neutering increases life expectancy: one study found neutered male cats live twice as long as intact males, while spayed female cats live 62% longer than intact females.',
  2.770833333333333),
 ("Domestic cats' scent rubbing behavior towards humans or other cats is thought to be a feline means for social bonding.",
  2.7222222222222223),
 ('Feral cats are domestic cats that were born in or have reverted to a wild state.',
  2.4097222222222228),
 ('The social behavior of the domestic cat ranges from widely dispersed individuals to feral cat colonies that gather around a food source, based on groups of co-operating females.',
  2.3819444444444446),
 ('Free-fed feral cats and house cats consume several small meals in a day.',
  2.326388888888889)]
In [115]:
summary = [sent[0] for sent in sorted_sentences[:5]]
''.join(summary)
Out[115]:
'The cat is either a house cat or a farm cat, which are pets, or a feral cat, which ranges freely and avoids human contact.Despite widespread concern about the welfare of free-roaming cats, the lifespans of neutered feral cats in managed colonies compare favorably with those of pet cats.Domestic cats are bred and shown as registered pedigreed cats, a hobby known as cat fancy.House cats often mate with feral cats, producing hybrids such as the Kellas cat in Scotland.Life in proximity to humans and other domestic animals has led to a symbiotic social adaptation in cats, and cats may express great affection toward humans or other animals.'
In [116]:
''.join(summary).strip()
Out[116]:
'The cat is either a house cat or a farm cat, which are pets, or a feral cat, which ranges freely and avoids human contact.Despite widespread concern about the welfare of free-roaming cats, the lifespans of neutered feral cats in managed colonies compare favorably with those of pet cats.Domestic cats are bred and shown as registered pedigreed cats, a hobby known as cat fancy.House cats often mate with feral cats, producing hybrids such as the Kellas cat in Scotland.Life in proximity to humans and other domestic animals has led to a symbiotic social adaptation in cats, and cats may express great affection toward humans or other animals.'
In [117]:
summary_2 = [sent[0] for sent in sentence_scores.items() if sent[1] > 3]
''.join(summary_2).strip()
Out[117]:
'The cat is either a house cat or a farm cat, which are pets, or a feral cat, which ranges freely and avoids human contact.Domestic cats are bred and shown as registered pedigreed cats, a hobby known as cat fancy.House cats often mate with feral cats, producing hybrids such as the Kellas cat in Scotland.Life in proximity to humans and other domestic animals has led to a symbiotic social adaptation in cats, and cats may express great affection toward humans or other animals.Despite widespread concern about the welfare of free-roaming cats, the lifespans of neutered feral cats in managed colonies compare favorably with those of pet cats.'
In [ ]:
 
In [ ]:
 
In [ ]: