HOW TO SUMMARIZE IN PYTHON

Following this tutorial! | 10-13-19

STEP 1: GET THE DATA!!

Step 1a: Import libraries

In [102]:
import bs4 as bs
import urllib.request
import re

Step 1b: Use the libraries to scrape the WHOLE INTERNET!! (jk just this page)

In [103]:
scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Artificial_intelligence')
article = scraped_data.read()
parsed_article = bs.BeautifulSoup(article,'lxml')

Step 1c: Use find_all from BeautifulSoup to get all of the p tags

In [104]:
paragraphs = parsed_article.find_all('p')
article_text = ""
for p in paragraphs:
    article_text += p.text
In [105]:
article_text[:1000]
Out[105]:
'\nIn computer science,  artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans. Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.[1] Colloquially, the term "artificial intelligence" is often used to describe machines (or computers) that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving".[2]\nAs machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect.[3] A quip in Tesler\'s Theorem says "AI is whatever hasn\'t been done yet."[4] For instance, optical character recognition is frequently excluded from things considered to be AI, having become a routine technology.[5] Mo'

STEP 2: CLEAN (& preprocess) THE DATA!!

Step 2a: Use regex and re.sub to remove square brackets and extra spaces from ORIGINAL article_text

In [106]:
article_text = re.sub(r'\[[0-9]*\]', '', article_text)
article_text = re.sub(r'\s+', ' ', article_text)
article_text[:1000]
Out[106]:
' In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans. Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. Colloquially, the term "artificial intelligence" is often used to describe machines (or computers) that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving". As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect. A quip in Tesler\'s Theorem says "AI is whatever hasn\'t been done yet." For instance, optical character recognition is frequently excluded from things considered to be AI, having become a routine technology. Modern machine cap'

Step 2b: Use regex and re.sub to remove extra characters and digits for a new FORMATTED_TEXT variable

In [107]:
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text)
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)
In [108]:
formatted_article_text[:1000]
Out[108]:
' In computer science artificial intelligence AI sometimes called machine intelligence is intelligence demonstrated by machines in contrast to the natural intelligence displayed by humans Leading AI textbooks define the field as the study of intelligent agents any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals Colloquially the term artificial intelligence is often used to describe machines or computers that mimic cognitive functions that humans associate with the human mind such as learning and problem solving As machines become increasingly capable tasks considered to require intelligence are often removed from the definition of AI a phenomenon known as the AI effect A quip in Tesler s Theorem says AI is whatever hasn t been done yet For instance optical character recognition is frequently excluded from things considered to be AI having become a routine technology Modern machine capabilities generally classified as A'

STEP 3: TOKENIZE SENTENCES!!

In [109]:
import nltk
sentence_list = nltk.sent_tokenize(article_text)
sentence_list[:5]
Out[109]:
[' In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans.',
 'Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.',
 'Colloquially, the term "artificial intelligence" is often used to describe machines (or computers) that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving".',
 'As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect.',
 'A quip in Tesler\'s Theorem says "AI is whatever hasn\'t been done yet."']

STEP 4: FIND WORD FREQUENCY, WEIGHTED!!

Step 4a: Remove Stopwords

In [110]:
stopwords = nltk.corpus.stopwords.words('english')

Step 4b: Tokenize Words & DIY Frequency Distribution

In [111]:
word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

Step 4c: Calculate Weighted Frequency

In [112]:
max_frequency = max(word_frequencies.values())
for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/max_frequency)

STEP 5: CALCULATE SENTENCE SCORES

In [113]:
## ILLUSTRATIVE EXAMPLE
## Nothing removed
for sent in sentence_list[:1]:
    for word in nltk.word_tokenize(sent.lower()):
        print(word)
in
computer
science
,
artificial
intelligence
(
ai
)
,
sometimes
called
machine
intelligence
,
is
intelligence
demonstrated
by
machines
,
in
contrast
to
the
natural
intelligence
displayed
by
humans
.
In [114]:
## ILLUSTRATIVE EXAMPLE
## Stopwords etc. removed
## We are ONLY assigning values/weights to the words in the sentences that are inside our freq dist!

for sent in sentence_list[:1]:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            print(word)
computer
science
artificial
intelligence
sometimes
called
machine
intelligence
intelligence
demonstrated
machines
contrast
natural
intelligence
displayed
humans
In [115]:
sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower())[:50]:
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]
In [116]:
sorted_sentences = sorted(sentence_scores.items(), key=lambda kv: kv[1], reverse=True)
sorted_sentences[:10]
Out[116]:
[(' In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans.',
  3.021276595744681),
 ('Neural networks can be applied to the problem of intelligent control (for robotics) or learning, using such techniques as Hebbian learning ("fire together, wire together"), GMDH or competitive learning.',
  1.8829787234042552),
 ('Musk also funds companies developing artificial intelligence such as Google DeepMind and Vicarious to "just keep an eye on what\'s going on with artificial intelligence.',
  1.8191489361702127),
 ('Many of the problems in this article may also require general intelligence, if machines are to solve the problems as well as people do.',
  1.8031914893617018),
 ('IBM has created its own artificial intelligence computer, the IBM Watson, which has beaten human intelligence (at some levels).',
  1.7925531914893618),
 ('"robotics" or "machine learning"), the use of particular tools ("logic" or artificial neural networks), or deep philosophical differences.',
  1.7553191489361701),
 ('A superintelligence, hyperintelligence, or superhuman intelligence is a hypothetical agent that would possess intelligence far surpassing that of the brightest and most gifted human mind.',
  1.7021276595744679),
 ('Many tools are used in AI, including versions of search and mathematical optimization, artificial neural networks, and methods based on statistics, probability and economics.',
  1.6223404255319152),
 ('The traditional problems (or goals) of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception and the ability to move and manipulate objects.',
  1.6117021276595742),
 ('The overall research goal of artificial intelligence is to create technology that allows computers and machines to function in an intelligent manner.',
  1.5744680851063828)]
In [117]:
summary = [sent[0] for sent in sorted_sentences[:10]]
''.join(summary)
Out[117]:
' In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans.Neural networks can be applied to the problem of intelligent control (for robotics) or learning, using such techniques as Hebbian learning ("fire together, wire together"), GMDH or competitive learning.Musk also funds companies developing artificial intelligence such as Google DeepMind and Vicarious to "just keep an eye on what\'s going on with artificial intelligence.Many of the problems in this article may also require general intelligence, if machines are to solve the problems as well as people do.IBM has created its own artificial intelligence computer, the IBM Watson, which has beaten human intelligence (at some levels)."robotics" or "machine learning"), the use of particular tools ("logic" or artificial neural networks), or deep philosophical differences.A superintelligence, hyperintelligence, or superhuman intelligence is a hypothetical agent that would possess intelligence far surpassing that of the brightest and most gifted human mind.Many tools are used in AI, including versions of search and mathematical optimization, artificial neural networks, and methods based on statistics, probability and economics.The traditional problems (or goals) of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception and the ability to move and manipulate objects.The overall research goal of artificial intelligence is to create technology that allows computers and machines to function in an intelligent manner.'
In [118]:
''.join(summary).strip()
Out[118]:
'In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans.Neural networks can be applied to the problem of intelligent control (for robotics) or learning, using such techniques as Hebbian learning ("fire together, wire together"), GMDH or competitive learning.Musk also funds companies developing artificial intelligence such as Google DeepMind and Vicarious to "just keep an eye on what\'s going on with artificial intelligence.Many of the problems in this article may also require general intelligence, if machines are to solve the problems as well as people do.IBM has created its own artificial intelligence computer, the IBM Watson, which has beaten human intelligence (at some levels)."robotics" or "machine learning"), the use of particular tools ("logic" or artificial neural networks), or deep philosophical differences.A superintelligence, hyperintelligence, or superhuman intelligence is a hypothetical agent that would possess intelligence far surpassing that of the brightest and most gifted human mind.Many tools are used in AI, including versions of search and mathematical optimization, artificial neural networks, and methods based on statistics, probability and economics.The traditional problems (or goals) of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception and the ability to move and manipulate objects.The overall research goal of artificial intelligence is to create technology that allows computers and machines to function in an intelligent manner.'