If you’re interested in Data Analytics, you will find learning about Natural Language Processing very useful. A good project to start learning about NLP is to write a summarizer – an algorithm to reduce bodies of text but keeping its original meaning, or giving a great insight into the original text.
There are many libraries for NLP. For this project, we will be using NLTK – the Natural Language Toolkit.
Let’s start by writing down the steps necessary to build our project.
4 steps to build a Summarizer
- Remove stop words (defined below) for the analysis
- Create frequency table of words – how many times each word appears in the text
- Assign score to each sentence depending on the words it contains and the frequency table
- Build summary by adding every sentence above a certain score threshold
That’s it! And the Python implementation is also short and straightforward.
What are stop words?
Any word that does not add a value to the meaning of a sentence. For example, let’s say we have the sentence
A group of people run every day from a bank in Alafaya to the nearest Chipotle
By removing the sentence’s stop words, we can narrow the number of words and preserve the meaning:
Group of people run every day from bank Alafaya to nearest Chipotle
We usually remove stop words from the analyzed text as knowing their frequency doesn’t give any insight to the body of text. In this example, we removed the instances of the words a, in, and the.
Now, let’s start!
There are two NLTK libraries that will be necessary for building an efficient summarizer.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
Note: There are more libraries that can make our summarizer better, one example is discussed at the end of this article.
Corpus
Corpus means a collection of text. It could be data sets of poems by a certain poet, bodies of work by a certain author, etc. In this case, we are going to use a data set of pre-determined stop words.
Tokenizers
Basically, it divides a text into a series of tokens. There are three main tokenizers – word, sentence, and regex tokenizer. For this specific project, we will only use the word and sentence tokenizer.
Removing stop words and making frequency table
First, we create two arrays – one for stop words, and one for every word in the body of text.
Let’s use text
as the original body of text.
stopWords = set(stopwords.words("english"))
words = word_tokenize(text)
Second, we create a dictionary for the word frequency table. For this, we should only use the words that are not part of the stopWords array.
freqTable = dict()
for word in words:
word = word.lower()
if word in stopWords:
continue
if word in freqTable:
freqTable[word] += 1
else:
freqTable[word] = 1
Now, we can use the freqTable
dictionary over every sentence to know which sentences have the most relevant insight to the overall purpose of the text.
Assigning a score to every sentence
We already have a sentence tokenizer, so we just need to run the sent_tokenize()
method to create the array of sentences. Secondly, we will need a dictionary to keep the score of each sentence, this way we can later go through the dictionary to generate the summary.
sentences = sent_tokenize(text)
sentenceValue = dict()
Now it’s time to go through every sentence and give it a score depending on the words it has. There are many algorithms to do this – basically, any consistent way to score a sentence by its words will work. I went for a basic algorithm: adding the frequency of every non-stop word in a sentence.
for sentence in sentences:
for wordValue in freqTable:
if wordValue[0] in sentence.lower():
if sentence[:12] in sentenceValue:
sentenceValue[sentence[:12]] += wordValue[1]
else:
sentenceValue[sentence[:12]] = wordValue[1]
Note: Index 0 of wordValue will return the word itself. Index 1 the number of instances.
If sentence[:12]
caught your eye, nice catch. This is just a simple way to hash each sentence into the dictionary.
Notice that a potential issue with our score algorithm is that long sentences will have an advantage over short sentences. To solve this, divide every sentence score by the number of words in the sentence.
So, what value can we use to compare our scores to?
A simple approach to this question is to find the average score of a sentence. From there, finding a threshold will be easy peasy lemon squeezy.
sumValues = 0
for sentence in sentenceValue:
sumValues += sentenceValue[sentence]
# Average value of a sentence from original text average = int(sumValues/ len(sentenceValue))
So, what’s a good threshold? The wrong value could give a summary that is too small/big.
The average itself can be a good threshold. For my project, I decided to go for a shorter summary, so the threshold I use for it is one-and-a-half times the average.
Now, let’s apply our threshold and store our sentences in order into our summary.
summary = ''
for sentence in sentences:
if sentence[:12] in sentenceValue and sentenceValue[sentence[:12]] > (1.5 * average):
summary += " " + sentence
You made it!! You can now print(summary)
and you’ll see how good our summary is.
Optional enhancement: Make smarter word frequency tables
Sometimes, we want two very similar words to add importance to the same word, e.g., mother, mom, and mommy. For this, we use a Stemmer – an algorithm to bring words to its root word.
To implement a Stemmer, we can use the NLTK stemmers’ library. You’ll notice there are many stemmers, each one is a different algorithm to find the root word, and one algorithm may be better than another for specific scenarios.
from nltk.stem import PorterStemmer
ps = PorterStemmer()
Then, pass every word by the stemmer before adding it to our freqTable
. It is important to stem every word when going through each sentence before adding the score of the words in it.
And we’re done!
Congratulations! Let me know if you have any other questions or enhancements to this summarizer.
Thanks for reading my first article! Good vibes
暂无评论内容