Word Analysis with NLTK’s WordNet Corpus

Language analysis can be accomplished computationally in many different ways. Neural networks and deep learning, anyone? As much fun as it can be to toss your documents into an algorithm and let it spit out important features for you, feature analysis at the word or sentence level can also be a great way to analyze text – especially if you are just getting started with natural language processing (NLP).

We will be using Python’s nltk library to analyze words using nltk’s WordNet Corpus.

What’s WordNet?

WordNet is a special kind of English dictionary. When you use WordNet to look up a word, it will not only return the appropriate definition of the word, but every sense associated with that word.

These senses, grouped together, are called synsets. Each synset represents a distinct concept associated with the word.

For instance, the word “bat” can be a noun and refer to a nocturnal creature; however, another sense “bat” might have is a verb, referring to an object used by a Red Sox player to hit a home run, perhaps! Now, let’s get into some examples of how we can use the WordNet corpus in Python.

Quick Installation Note

If you don’t have the nltk library installed, open up your terminal and type the command:

pip install nltk

Now that you have nltk installed, we can go ahead and import the WordNet corpus like so:

from nltk.corpus import wordnet

Looking Up Words

Since we are working with a dictionary here, the next logical step would be to go ahead and look up a word. Since the wind is a-blowing and I’m feeling pretty positive, let’s go ahead and look up the word “breezy.”

breezy_syns = wordnet.synsets("breezy")
breezy_syns
>>> [Synset('breezy.s.01'), Synset('blowy.s.01')]

We can see that “breezy” has two senses. Let’s dissect the first sense to get a better understanding of the kind of information each synset contains. We can do this by indexing the first element at its name.

breezy_syns[0].name()
>>> 'breezy.s.01'

“Breezy’s” first sense contains the same kind of information encoded in the WordNet dictionary online:

breezy: the word associated with the sense (in this case, the literal and most common meaning of “breezy”).
s: the part of speech of the word (i.e. adjective)
01: the order of the word in the synset (it’s the first here, folks!)

We can also look up the definition of that particular word sense as well as examples to go along with it.

breezy_syns[0].definition()
>>> fresh and animated

breezy_syns[0].examples()
>>> ['her breezy nature']

Synonyms and Antonyms

Beyond just looking up words, WordNet can also be used to derive all synonyms or antonyms for a particular word. This definitely would have been useful for me while I was studying literature in college and trying to find the most preposterous adjective to describe my very mundane noun. Since breezy isn’t as popular of a word, let’s look up synonyms for a synset that may contain more variety such as the word “small.”

for sim in wordnet.synsets('small'):
    print(sim.name(), sim.lemma_names())

>>> small.n.01 ['small']
>>> small.n.02 ['small']
>>> small.a.01 ['small', 'little']
>>> minor.s.10 ['minor', 'modest', 'small', 'small-scale', 'pocket-size', 'pocket-sized']
>>> little.s.03 ['little', 'small']
>>> small.s.04 ['small']
>>> humble.s.01 ['humble', 'low', 'lowly', 'modest', 'small']
>>> little.s.07 ['little', 'minuscule', 'small']
>>> little.s.05 ['little', 'small']
>>> small.s.08 ['small']
>>> modest.s.02 ['modest', 'small']
>>> belittled.s.01 ['belittled', 'diminished', 'small']
>>> small.r.01 ['small']

In the above code, I’m iterating over every synset associated with “small” and returning its name representation and synonyms associated with those representations. “Small” certainly has a large vocabulary in this sense!

Of course, we way want to take a look at “small’s” antonyms as well.

for opp in wordnet.synsets('small'):
    if opp.lemmas()[0].antonyms() != []:
        print(opp.name(),opp.lemmas()[0].antonyms()[0].name())

>>> small.a.01 large
>>> small.r.01 big

There certainly could be the case where no antonyms exist for a particular word sense, so we’d want to be careful to not return any empty lists (hence, the conditional above). Here, I am returning all antonyms associated with “small’s” first and most common word sense.

Word Similarities

WordNet also takes it a step further by allowing you to perform simple word comparisons to detect the similarity between two word senses. This could be useful in NLP tasks such as word generation, similarity detection, or search/relevance. Let’s try it out!

dog = wordnet.synset('dog.n.01')
cat = wordnet.synset('cat.n.01')

collar = wordnet.synset('collar.n.01')

dog.wup_similarity(cat)
>>> 0.8571428571428571

dog.wup_similarity(collar)
>>> 0.47058823529411764

We can see that “dog” and “cat” are quite similar since they likely appear in context besides one another. It could also be that they are in the same mammal or pet category, with dogs being the superior choice, of course!

“Dog” and “collar,” while seemingly associated with one another in conversation, aren’t as similar given the actual meaning of the word.

It’s always interesting to see the probabilities returned from word comparisons in WordNet. If you try this out with some words, I’d love to see your results or hear your thoughts!

***

Well, that’s it for now! I hope I’ve encouraged you to explore some of what WordNet has to offer. I’m still working to learn about it myself and wanted to share with you what I’ve learned so far.

原文链接:Word Analysis with NLTK’s WordNet Corpus

© 版权声明
THE END
喜欢就支持一下吧
点赞5 分享
评论 抢沙发

请登录后发表评论

    暂无评论内容