Build a quick Summarizer with Python and NLTK

If you’re interested in Data Analytics, you will find learning about Natural Language Processing very useful. A good project to start learning about NLP is to write a summarizer – an algorithm to reduce bodies of text but keeping its original meaning, or giving a great insight into the original text.

There are many libraries for NLP. For this project, we will be using NLTK – the Natural Language Toolkit.

Let’s start by writing down the steps necessary to build our project.

4 steps to build a Summarizer

Remove stop words (defined below) for the analysis
Create frequency table of words – how many times each word appears in the text
Assign score to each sentence depending on the words it contains and the frequency table
Build summary by adding every sentence above a certain score threshold

That’s it! And the Python implementation is also short and straightforward.

What are stop words?

Any word that does not add a value to the meaning of a sentence. For example, let’s say we have the sentence

A group of people run every day from a bank in Alafaya to the nearest Chipotle

By removing the sentence’s stop words, we can narrow the number of words and preserve the meaning:

Group of people run every day from bank Alafaya to nearest Chipotle

We usually remove stop words from the analyzed text as knowing their frequency doesn’t give any insight to the body of text. In this example, we removed the instances of the words a, in, and the.

Now, let’s start!

There are two NLTK libraries that will be necessary for building an efficient summarizer.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

Note: There are more libraries that can make our summarizer better, one example is discussed at the end of this article.

Corpus

Corpus means a collection of text. It could be data sets of poems by a certain poet, bodies of work by a certain author, etc. In this case, we are going to use a data set of pre-determined stop words.

Tokenizers

Basically, it divides a text into a series of tokens. There are three main tokenizers – word, sentence, and regex tokenizer. For this specific project, we will only use the word and sentence tokenizer.

Removing stop words and making frequency table

First, we create two arrays – one for stop words, and one for every word in the body of text.

Let’s use text as the original body of text.

stopWords = set(stopwords.words("english"))
words = word_tokenize(text)

Second, we create a dictionary for the word frequency table. For this, we should only use the words that are not part of the stopWords array.

freqTable = dict()
for word in words:
    word = word.lower()
    if word in stopWords:
        continue
    if word in freqTable:
        freqTable[word] += 1
    else:
        freqTable[word] = 1

Now, we can use the freqTable dictionary over every sentence to know which sentences have the most relevant insight to the overall purpose of the text.

Assigning a score to every sentence

We already have a sentence tokenizer, so we just need to run the sent_tokenize() method to create the array of sentences. Secondly, we will need a dictionary to keep the score of each sentence, this way we can later go through the dictionary to generate the summary.

sentences = sent_tokenize(text)
sentenceValue = dict()

Now it’s time to go through every sentence and give it a score depending on the words it has. There are many algorithms to do this – basically, any consistent way to score a sentence by its words will work. I went for a basic algorithm: adding the frequency of every non-stop word in a sentence.

for sentence in sentences:
    for wordValue in freqTable:
        if wordValue[0] in sentence.lower():
            if sentence[:12] in sentenceValue:
                sentenceValue[sentence[:12]] += wordValue[1]
            else:
                sentenceValue[sentence[:12]] = wordValue[1]

Note: Index 0 of wordValue will return the word itself. Index 1 the number of instances.

If sentence[:12] caught your eye, nice catch. This is just a simple way to hash each sentence into the dictionary.

Notice that a potential issue with our score algorithm is that long sentences will have an advantage over short sentences. To solve this, divide every sentence score by the number of words in the sentence.

So, what value can we use to compare our scores to?

A simple approach to this question is to find the average score of a sentence. From there, finding a threshold will be easy peasy lemon squeezy.

sumValues = 0
for sentence in sentenceValue:
    sumValues += sentenceValue[sentence]

# Average value of a sentence from original text average = int(sumValues/ len(sentenceValue))

So, what’s a good threshold? The wrong value could give a summary that is too small/big.

The average itself can be a good threshold. For my project, I decided to go for a shorter summary, so the threshold I use for it is one-and-a-half times the average.

Now, let’s apply our threshold and store our sentences in order into our summary.

summary = ''
for sentence in sentences:
        if sentence[:12] in sentenceValue and sentenceValue[sentence[:12]] > (1.5 * average):
            summary +=  " " + sentence

You made it!! You can now print(summary) and you’ll see how good our summary is.

Optional enhancement: Make smarter word frequency tables

Sometimes, we want two very similar words to add importance to the same word, e.g., mother, mom, and mommy. For this, we use a Stemmer – an algorithm to bring words to its root word.

To implement a Stemmer, we can use the NLTK stemmers’ library. You’ll notice there are many stemmers, each one is a different algorithm to find the root word, and one algorithm may be better than another for specific scenarios.

from nltk.stem import PorterStemmer
ps = PorterStemmer()

Then, pass every word by the stemmer before adding it to our freqTable. It is important to stem every word when going through each sentence before adding the score of the words in it.

And we’re done!

Congratulations! Let me know if you have any other questions or enhancements to this summarizer.

Thanks for reading my first article! Good vibes

原文链接：Build a quick Summarizer with Python and NLTK

文章版权声明 1、本网站名称：拾光赋
2、本站永久网址：https://www.blogs.ink
3、本网站的文章部分内容可能来源于网络，仅供大家学习与参考，如有侵权，请联系站长QQ：805375623进行删除处理。
4、本站一切资源不代表本站立场，并不代表本站赞同其观点和对其真实性负责。
5、本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报
6、本站资源大多存储在云盘，如发现链接失效，请联系我们我们会第一时间更新。

THE END