HOW TO SUMMARIZE IN PYTHON¶

Following this tutorial! | 10-13-19

STEP 1: GET THE DATA!!¶

Step 1a: Import libraries¶

# file = open('../summary_test_positive.txt').readlines()
file = open('HP1.txt').readlines()

file[:10]

['\n',
 '1\n',
 "Harry Potter and the Sorcerer's Stone\n",
 'CHAPTER ONE\n',
 'THE BOY WHO LIVED\n',
 'Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say\n',
 'that they were perfectly normal, thank you very much. They were the last\n',
 "people you'd expect to be involved in anything strange or mysterious,\n",
 "because they just didn't hold with such nonsense.\n",
 'Mr. Dursley was the director of a firm called Grunnings, which made\n']

all_text = ""
for line in file:
    all_text += line

type(all_text)
all_text = all_text.replace("\n", " ")
all_text = all_text.replace("\'", "")

import re
# article_text = re.sub(r'\[[0-9]*\]', '', article_text)
all_text = re.sub(r'[0-9]', '', all_text)
chapters = all_text.split('CHAPTER ')
chapters[1][:100]

'ONE THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that the'

STEP 2: CLEAN (& preprocess) THE DATA!!¶

Step 2a: Use regex and `re.sub` to remove square brackets and extra spaces from ORIGINAL article_text¶

import re
# article_text = re.sub(r'\[[0-9]*\]', '', article_text)
formatted_article_text = re.sub(r'\n+', ' ', article_text)
formatted_article_text[:1000]

" 1 Harry Potter and the Sorcerer's Stone CHAPTER ONE THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere. The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out about the Potters. Mrs. "

Step 2b: Use regex and `re.sub` to remove extra characters and digits for a new FORMATTED_TEXT variable¶

# formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text)
# formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)

# formatted_article_text[:1000]

STEP 3: TOKENIZE SENTENCES!!¶

import nltk
sentence_list = nltk.sent_tokenize(article_text)
sentence_list[:5]
# formatted_article

[" 1 Harry Potter and the Sorcerer's Stone CHAPTER ONE THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much.",
 "They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense.",
 'Mr. Dursley was the director of a firm called Grunnings, which made drills.',
 'He was a big, beefy man with hardly any neck, although he did have a very large mustache.',
 'Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors.']

STEP 4: FIND WORD FREQUENCY, WEIGHTED!!¶

Step 4a: Remove Stopwords¶

stopwords = nltk.corpus.stopwords.words('english')

Step 4b: Tokenize Words & DIY Frequency Distribution¶

word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

Step 4c: Calculate Weighted Frequency¶

max_frequency = max(word_frequencies.values())
for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/max_frequency)

STEP 5: CALCULATE SENTENCE SCORES¶

## ILLUSTRATIVE EXAMPLE
## Nothing removed
for sent in sentence_list[:1]:
    for word in nltk.word_tokenize(sent.lower()):
        print(word)

1
harry
potter
and
the
sorcerer
's
stone
chapter
one
the
boy
who
lived
mr.
and
mrs.
dursley
,
of
number
four
,
privet
drive
,
were
proud
to
say
that
they
were
perfectly
normal
,
thank
you
very
much
.

## ILLUSTRATIVE EXAMPLE
## Stopwords etc. removed
## We are ONLY assigning values/weights to the words in the sentences that are inside our freq dist!

for sent in sentence_list[:1]:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            print(word)

1
sorcerer
's
stone
one
boy
lived
,
number
four
,
drive
,
proud
say
perfectly
normal
,
thank
much
.

sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

sorted_sentences = sorted(sentence_scores.items(), key=lambda kv: kv[1], reverse=True)
sorted_sentences[:5]

[('said Ron.', 16.89360197949805),
 ('said Harry.', 14.781901732060795),
 ('"Yes," said Harry.', 11.606221279604098),
 ('"I\'ll get him," said Ron, grinding his teeth at Malfoy\'s back, "one of these days, I\'ll get him --" "I hate them both," said Harry, "Malfoy and Snape."',
  11.08890067161541),
 ('"Moon" "Nott" "Parkinson" then a pair of twin girls, "Patil" and "Patil" then "Perks, Sally-Anne" and then, at last -- "Potter, Harry!"',
  10.131671968893599)]

summary = [sent[0] for sent in sorted_sentences[:10]]
''.join(summary)

'said Ron.said Harry."Yes," said Harry."I\'ll get him," said Ron, grinding his teeth at Malfoy\'s back, "one of these days, I\'ll get him --" "I hate them both," said Harry, "Malfoy and Snape.""Moon" "Nott" "Parkinson" then a pair of twin girls, "Patil" and "Patil" then "Perks, Sally-Anne" and then, at last -- "Potter, Harry!"Harry tried to remember, left, right, right, left, middle fork, right, left, but it was impossible.Malfoy, Crabbe, and Goyle howled with laughter, but Ron, still not daring to take his eyes from the game, said, "You tell him, Neville."piped a small girl, also red-headed, who was holding her hand, "Mom, can\'t I go... " "You\'re not old enough, Ginny, now be quiet.See, there\'s Potter, who\'s got no parents, then there\'s the Weasleys, who\'ve got no money -- you should be on the team, Longbottom, you\'ve got no brains."As the snake slid swiftly past him, Harry could have sworn a low, hissing voice said, "Brazil, here I come.... Thanksss, amigo."'

lolsummary = ''.join(summary).strip()

# lolsummary = re.sub(r'\\', ' ', lolsummary)
# lolsummary
lolsummary = lolsummary.replace("\'", "")
lolsummary
# type(lolsummary)

'said Ron.said Harry."Yes," said Harry."Ill get him," said Ron, grinding his teeth at Malfoys back, "one of these days, Ill get him --" "I hate them both," said Harry, "Malfoy and Snape.""Moon" "Nott" "Parkinson" then a pair of twin girls, "Patil" and "Patil" then "Perks, Sally-Anne" and then, at last -- "Potter, Harry!"Harry tried to remember, left, right, right, left, middle fork, right, left, but it was impossible.Malfoy, Crabbe, and Goyle howled with laughter, but Ron, still not daring to take his eyes from the game, said, "You tell him, Neville."piped a small girl, also red-headed, who was holding her hand, "Mom, cant I go... " "Youre not old enough, Ginny, now be quiet.See, theres Potter, whos got no parents, then theres the Weasleys, whove got no money -- you should be on the team, Longbottom, youve got no brains."As the snake slid swiftly past him, Harry could have sworn a low, hissing voice said, "Brazil, here I come.... Thanksss, amigo."'

HOW TO SUMMARIZE IN PYTHON¶

STEP 1: GET THE DATA!!¶

Step 1a: Import libraries¶

STEP 2: CLEAN (& preprocess) THE DATA!!¶

Step 2a: Use regex and re.sub to remove square brackets and extra spaces from ORIGINAL article_text¶

Step 2b: Use regex and re.sub to remove extra characters and digits for a new FORMATTED_TEXT variable¶

STEP 3: TOKENIZE SENTENCES!!¶

STEP 4: FIND WORD FREQUENCY, WEIGHTED!!¶

Step 4a: Remove Stopwords¶

Step 4b: Tokenize Words & DIY Frequency Distribution¶

Step 4c: Calculate Weighted Frequency¶

STEP 5: CALCULATE SENTENCE SCORES¶

Step 2a: Use regex and `re.sub` to remove square brackets and extra spaces from ORIGINAL article_text¶

Step 2b: Use regex and `re.sub` to remove extra characters and digits for a new FORMATTED_TEXT variable¶