HOW TO SUMMARIZE IN PYTHON

Following this tutorial! | 10-13-19

STEP 1: GET THE DATA!!

Step 1a: Import libraries

In [1]:
import bs4 as bs
import urllib.request
import re

Step 1b: Use the libraries to scrape the WHOLE INTERNET!! (jk just this page)

In [2]:
# url = 'https://en.wikipedia.org/wiki/Lizard'
# url = 'https://en.wikipedia.org/wiki/cat'
url = 'https://en.wikipedia.org/wiki/Naive_Bayes_classifier'
# url = 'https://en.wikipedia.org/wiki/Machine_learning' # good at 20 words
# url = 'https://en.wikipedia.org/wiki/Artificial_intelligence' # good at 30 words
# scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Artificial_intelligence')
# scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Harry_Potter_and_the_Philosopher%27s_Stone')
scraped_data = urllib.request.urlopen(url)
article = scraped_data.read()
parsed_article = bs.BeautifulSoup(article,'lxml')

Step 1c: Use find_all from BeautifulSoup to get all of the p tags

In [3]:
paragraphs = parsed_article.find_all('p')
article_text = ""
for p in paragraphs:
    article_text += p.text
In [4]:
article_text[:1000]
Out[4]:
'In machine learning, naïve Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes\' theorem with strong (naïve) independence assumptions between the features. They are among the simplest Bayesian network models.[1]\nNaïve Bayes has been studied extensively since the 1960s. It was introduced (though not under that name) into the text retrieval community in the early 1960s,[2] and remains a popular (baseline) method for text categorization, the problem of judging documents as belonging to one category or the other (such as spam or legitimate, sports or politics, etc.) with word frequencies as the features. With appropriate pre-processing, it is competitive in this domain with more advanced methods including support vector machines.[3] It also finds application in automatic medical diagnosis.[4]\nNaïve Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features/predictors) in a learning problem. Max'

STEP 2: CLEAN (& preprocess) THE DATA!!

Step 2a: Use regex and re.sub to remove square brackets and extra spaces from ORIGINAL article_text

In [5]:
article_text = re.sub(r'\[[0-9]*\]', '', article_text)
article_text = re.sub(r'\s+', ' ', article_text)
article_text[:1000]
Out[5]:
'In machine learning, naïve Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes\' theorem with strong (naïve) independence assumptions between the features. They are among the simplest Bayesian network models. Naïve Bayes has been studied extensively since the 1960s. It was introduced (though not under that name) into the text retrieval community in the early 1960s, and remains a popular (baseline) method for text categorization, the problem of judging documents as belonging to one category or the other (such as spam or legitimate, sports or politics, etc.) with word frequencies as the features. With appropriate pre-processing, it is competitive in this domain with more advanced methods including support vector machines. It also finds application in automatic medical diagnosis. Naïve Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features/predictors) in a learning problem. Maximum-likelih'

Step 2b: Use regex and re.sub to remove extra characters and digits for a new FORMATTED_TEXT variable

In [6]:
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text)
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)
In [7]:
formatted_article_text[:1000]
Out[7]:
'In machine learning na ve Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes theorem with strong na ve independence assumptions between the features They are among the simplest Bayesian network models Na ve Bayes has been studied extensively since the s It was introduced though not under that name into the text retrieval community in the early s and remains a popular baseline method for text categorization the problem of judging documents as belonging to one category or the other such as spam or legitimate sports or politics etc with word frequencies as the features With appropriate pre processing it is competitive in this domain with more advanced methods including support vector machines It also finds application in automatic medical diagnosis Na ve Bayes classifiers are highly scalable requiring a number of parameters linear in the number of variables features predictors in a learning problem Maximum likelihood training can be done by evaluati'

STEP 3: TOKENIZE SENTENCES!!

In [8]:
import nltk
sentence_list = nltk.sent_tokenize(article_text)
sentence_list[:5]
Out[8]:
['In machine learning, naïve Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes\' theorem with strong (naïve) independence assumptions between the features.',
 'They are among the simplest Bayesian network models.',
 'Naïve Bayes has been studied extensively since the 1960s.',
 'It was introduced (though not under that name) into the text retrieval community in the early 1960s, and remains a popular (baseline) method for text categorization, the problem of judging documents as belonging to one category or the other (such as spam or legitimate, sports or politics, etc.)',
 'with word frequencies as the features.']

STEP 4: FIND WORD FREQUENCY, WEIGHTED!!

Step 4a: Remove Stopwords

In [9]:
stopwords = nltk.corpus.stopwords.words('english')

Step 4b: Tokenize Words & DIY Frequency Distribution

In [10]:
word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

Step 4c: Calculate Weighted Frequency

In [11]:
max_frequency = max(word_frequencies.values())
for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/max_frequency)

STEP 5: CALCULATE SENTENCE SCORES

In [12]:
## ILLUSTRATIVE EXAMPLE
## Nothing removed
for sent in sentence_list[:1]:
    for word in nltk.word_tokenize(sent.lower()):
        print(word)
in
machine
learning
,
naïve
bayes
classifiers
are
a
family
of
simple
``
probabilistic
classifiers
''
based
on
applying
bayes
'
theorem
with
strong
(
naïve
)
independence
assumptions
between
the
features
.
In [13]:
## ILLUSTRATIVE EXAMPLE
## Stopwords etc. removed
## We are ONLY assigning values/weights to the words in the sentences that are inside our freq dist!

for sent in sentence_list[:1]:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            print(word)
machine
learning
classifiers
family
simple
probabilistic
classifiers
based
applying
theorem
strong
independence
assumptions
features
In [14]:
sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower())[:50]:
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]
In [15]:
sorted_sentences = sorted(sentence_scores.items(), key=lambda kv: kv[1], reverse=True)
sorted_sentences[:10]
Out[15]:
[('We first segment the data by the class, and then compute the mean and variance of x {\\displaystyle x} in each class.',
  3.283018867924528),
 ('For example, the naive Bayes classifier will make the correct MAP decision rule classification so long as the correct class is more probable than any other class.',
  2.773584905660377),
 ('Like the multinomial model, this model is popular for document classification tasks, where binary term occurrence features are used rather than term frequencies.',
  2.7169811320754724),
 ('For example, suppose the training data contains a continuous attribute, x {\\displaystyle x} .',
  2.7169811320754715),
 ('The discussion so far has derived the independent feature model, that is, the naive Bayes probability model.',
  2.622641509433962),
 ('If a given class and feature value never occur together in the training data, then the frequency-based probability estimate will be zero.',
  2.4150943396226414),
 ('Note that a naive Bayes classifier with a Bernoulli event model is not the same as a multinomial NB classifier with frequency counts truncated to one.',
  2.377358490566038),
 ('The assumptions on distributions of features are called the event model of the Naive Bayes classifier.',
  2.2264150943396226),
 ('This is the event model typically used for document classification, with events representing the occurrence of a word in a single document (see bag of words assumption).',
  2.2264150943396226),
 ('In this manner, the overall classifier can be robust enough to ignore serious deficiencies in its underlying naive probability model.',
  2.2075471698113205)]
In [16]:
summary = [sent[0] for sent in sorted_sentences[:5]]
''.join(summary)
Out[16]:
'We first segment the data by the class, and then compute the mean and variance of x {\\displaystyle x} in each class.For example, the naive Bayes classifier will make the correct MAP decision rule classification so long as the correct class is more probable than any other class.Like the multinomial model, this model is popular for document classification tasks, where binary term occurrence features are used rather than term frequencies.For example, suppose the training data contains a continuous attribute, x {\\displaystyle x} .The discussion so far has derived the independent feature model, that is, the naive Bayes probability model.'
In [17]:
''.join(summary).strip()
Out[17]:
'We first segment the data by the class, and then compute the mean and variance of x {\\displaystyle x} in each class.For example, the naive Bayes classifier will make the correct MAP decision rule classification so long as the correct class is more probable than any other class.Like the multinomial model, this model is popular for document classification tasks, where binary term occurrence features are used rather than term frequencies.For example, suppose the training data contains a continuous attribute, x {\\displaystyle x} .The discussion so far has derived the independent feature model, that is, the naive Bayes probability model.'
In [18]:
summary_2 = [sent[0] for sent in sentence_scores.items() if sent[1] > 3]
''.join(summary_2).strip()
Out[18]:
'We first segment the data by the class, and then compute the mean and variance of x {\\displaystyle x} in each class.'
In [ ]:
 
In [ ]:
 
In [ ]: