Advanced NLP Tasks: Taking Text Processing to the Next Level

Basic text preprocessing cleans and structures raw text, but advanced NLP tasks help models understand meaning, context, and structure better. These techniques improve accuracy in chatbots, search engines, sentiment analysis, and text summarization.

Let’s explore these key advanced NLP preprocessing tasks with examples and code!


1️⃣ Handling Dates & Times – Standardizing Temporal Data

Problem:

Dates and times are inconsistent in text data:

  • "Jan 1st, 2024"
  • "1/1/24"
  • "2024-01-01"

NLP models need a uniform format to process dates correctly.

Solution: Use dateparser to standardize dates into ISO 8601 (YYYY-MM-DD).

<span>from</span> <span>dateparser</span> <span>import</span> <span>parse</span>
<span>date_text</span> <span>=</span> <span>"</span><span>Jan 1st, 2024</span><span>"</span>
<span>normalized_date</span> <span>=</span> <span>parse</span><span>(</span><span>date_text</span><span>).</span><span>strftime</span><span>(</span><span>"</span><span>%Y-%m-%d</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>normalized_date</span><span>)</span>
<span>from</span> <span>dateparser</span> <span>import</span> <span>parse</span>

<span>date_text</span> <span>=</span> <span>"</span><span>Jan 1st, 2024</span><span>"</span>
<span>normalized_date</span> <span>=</span> <span>parse</span><span>(</span><span>date_text</span><span>).</span><span>strftime</span><span>(</span><span>"</span><span>%Y-%m-%d</span><span>"</span><span>)</span>

<span>print</span><span>(</span><span>normalized_date</span><span>)</span>
from dateparser import parse date_text = "Jan 1st, 2024" normalized_date = parse(date_text).strftime("%Y-%m-%d") print(normalized_date)

Enter fullscreen mode Exit fullscreen mode

Output:

"2024-01-01"

Why is this useful?

  • Helps event-based NLP applications like scheduling bots, timeline analysis, and news tracking.

2️⃣ Text Augmentation – Generating Synthetic Data

Problem:

NLP models require a lot of labeled data, but collecting it is expensive.

Solution: Generate synthetic data using back-translation, synonym replacement, or paraphrasing.

Example (Back-translation with Google Translate API)

<span>from</span> <span>deep_translator</span> <span>import</span> <span>GoogleTranslator</span>
<span>text</span> <span>=</span> <span>"</span><span>The weather is amazing today!</span><span>"</span>
<span>translated_text</span> <span>=</span> <span>GoogleTranslator</span><span>(</span><span>source</span><span>=</span><span>"</span><span>auto</span><span>"</span><span>,</span> <span>target</span><span>=</span><span>"</span><span>fr</span><span>"</span><span>).</span><span>translate</span><span>(</span><span>text</span><span>)</span>
<span>augmented_text</span> <span>=</span> <span>GoogleTranslator</span><span>(</span><span>source</span><span>=</span><span>"</span><span>fr</span><span>"</span><span>,</span> <span>target</span><span>=</span><span>"</span><span>en</span><span>"</span><span>).</span><span>translate</span><span>(</span><span>translated_text</span><span>)</span>
<span>print</span><span>(</span><span>augmented_text</span><span>)</span>
<span>from</span> <span>deep_translator</span> <span>import</span> <span>GoogleTranslator</span>

<span>text</span> <span>=</span> <span>"</span><span>The weather is amazing today!</span><span>"</span>
<span>translated_text</span> <span>=</span> <span>GoogleTranslator</span><span>(</span><span>source</span><span>=</span><span>"</span><span>auto</span><span>"</span><span>,</span> <span>target</span><span>=</span><span>"</span><span>fr</span><span>"</span><span>).</span><span>translate</span><span>(</span><span>text</span><span>)</span>
<span>augmented_text</span> <span>=</span> <span>GoogleTranslator</span><span>(</span><span>source</span><span>=</span><span>"</span><span>fr</span><span>"</span><span>,</span> <span>target</span><span>=</span><span>"</span><span>en</span><span>"</span><span>).</span><span>translate</span><span>(</span><span>translated_text</span><span>)</span>

<span>print</span><span>(</span><span>augmented_text</span><span>)</span>
from deep_translator import GoogleTranslator text = "The weather is amazing today!" translated_text = GoogleTranslator(source="auto", target="fr").translate(text) augmented_text = GoogleTranslator(source="fr", target="en").translate(translated_text) print(augmented_text)

Enter fullscreen mode Exit fullscreen mode

Output (Paraphrased text):

"Today's weather is wonderful!"

Why is this useful?

  • Helps train models on low-resource languages.
  • Improves sentiment analysis and chatbot response diversity.

3️⃣ Handling Negations – Understanding Not Bad ≠ Bad

Problem:

Negations change sentence meaning:

  • "This movie is not bad""This movie is bad"

Solution: Detect negations and adjust sentiment scores.

<span>from</span> <span>textblob</span> <span>import</span> <span>TextBlob</span>
<span>text1</span> <span>=</span> <span>"</span><span>This movie is bad.</span><span>"</span>
<span>text2</span> <span>=</span> <span>"</span><span>This movie is not bad.</span><span>"</span>
<span>print</span><span>(</span><span>TextBlob</span><span>(</span><span>text1</span><span>).</span><span>sentiment</span><span>.</span><span>polarity</span><span>)</span> <span># Output: -0.7 </span><span>print</span><span>(</span><span>TextBlob</span><span>(</span><span>text2</span><span>).</span><span>sentiment</span><span>.</span><span>polarity</span><span>)</span> <span># Output: 0.3 </span>
<span>from</span> <span>textblob</span> <span>import</span> <span>TextBlob</span>

<span>text1</span> <span>=</span> <span>"</span><span>This movie is bad.</span><span>"</span>
<span>text2</span> <span>=</span> <span>"</span><span>This movie is not bad.</span><span>"</span>

<span>print</span><span>(</span><span>TextBlob</span><span>(</span><span>text1</span><span>).</span><span>sentiment</span><span>.</span><span>polarity</span><span>)</span>  <span># Output: -0.7 </span><span>print</span><span>(</span><span>TextBlob</span><span>(</span><span>text2</span><span>).</span><span>sentiment</span><span>.</span><span>polarity</span><span>)</span>  <span># Output: 0.3 </span>
from textblob import TextBlob text1 = "This movie is bad." text2 = "This movie is not bad." print(TextBlob(text1).sentiment.polarity) # Output: -0.7 print(TextBlob(text2).sentiment.polarity) # Output: 0.3

Enter fullscreen mode Exit fullscreen mode

Why is this useful?

  • Essential for sentiment analysis and opinion mining.
  • Prevents incorrect model predictions.

4️⃣ Dependency Parsing – Understanding Sentence Structure

Problem:

Sentence structure matters:

  • "I love NLP"“love” is the verb, “NLP” is the object

Solution: Use spaCy to analyze grammatical relationships.

<span>import</span> <span>spacy</span>
<span>nlp</span> <span>=</span> <span>spacy</span><span>.</span><span>load</span><span>(</span><span>"</span><span>en_core_web_sm</span><span>"</span><span>)</span>
<span>text</span> <span>=</span> <span>"</span><span>I love NLP.</span><span>"</span>
<span>doc</span> <span>=</span> <span>nlp</span><span>(</span><span>text</span><span>)</span>
<span>for</span> <span>token</span> <span>in</span> <span>doc</span><span>:</span>
<span>print</span><span>(</span><span>token</span><span>.</span><span>text</span><span>,</span> <span>"</span><span>→</span><span>"</span><span>,</span> <span>token</span><span>.</span><span>dep_</span><span>,</span> <span>"</span><span>→</span><span>"</span><span>,</span> <span>token</span><span>.</span><span>head</span><span>.</span><span>text</span><span>)</span>
<span>import</span> <span>spacy</span>

<span>nlp</span> <span>=</span> <span>spacy</span><span>.</span><span>load</span><span>(</span><span>"</span><span>en_core_web_sm</span><span>"</span><span>)</span>
<span>text</span> <span>=</span> <span>"</span><span>I love NLP.</span><span>"</span>
<span>doc</span> <span>=</span> <span>nlp</span><span>(</span><span>text</span><span>)</span>

<span>for</span> <span>token</span> <span>in</span> <span>doc</span><span>:</span>
    <span>print</span><span>(</span><span>token</span><span>.</span><span>text</span><span>,</span> <span>"</span><span>→</span><span>"</span><span>,</span> <span>token</span><span>.</span><span>dep_</span><span>,</span> <span>"</span><span>→</span><span>"</span><span>,</span> <span>token</span><span>.</span><span>head</span><span>.</span><span>text</span><span>)</span>
import spacy nlp = spacy.load("en_core_web_sm") text = "I love NLP." doc = nlp(text) for token in doc: print(token.text, "", token.dep_, "", token.head.text)

Enter fullscreen mode Exit fullscreen mode

Output:

I → nsubj → love
love → ROOT → love
NLP → dobj → love
I → nsubj → love
love → ROOT → love
NLP → dobj → love
I → nsubj → love love → ROOT → love NLP → dobj → love

Enter fullscreen mode Exit fullscreen mode

Why is this useful?

  • Helps chatbots understand user intent.
  • Improves machine translation and grammar checking.

5️⃣ Text Chunking – Grouping Words into Meaningful Phrases

Problem:

A sentence contains phrases that should be treated as a unit:

  • "New York" should be a proper noun phrase instead of two separate words.

Solution: Use NLTK for chunking noun phrases.

<span>import</span> <span>nltk</span>
<span>nltk</span><span>.</span><span>download</span><span>(</span><span>"</span><span>averaged_perceptron_tagger</span><span>"</span><span>)</span>
<span>from</span> <span>nltk</span> <span>import</span> <span>pos_tag</span><span>,</span> <span>word_tokenize</span>
<span>from</span> <span>nltk.chunk</span> <span>import</span> <span>RegexpParser</span>
<span>text</span> <span>=</span> <span>"</span><span>I visited New York last summer.</span><span>"</span>
<span>tokens</span> <span>=</span> <span>word_tokenize</span><span>(</span><span>text</span><span>)</span>
<span>pos_tags</span> <span>=</span> <span>pos_tag</span><span>(</span><span>tokens</span><span>)</span>
<span>chunker</span> <span>=</span> <span>RegexpParser</span><span>(</span><span>r</span><span>"</span><span>NP: {<DT>?<JJ>*<NN.*>+}</span><span>"</span><span>)</span>
<span>tree</span> <span>=</span> <span>chunker</span><span>.</span><span>parse</span><span>(</span><span>pos_tags</span><span>)</span>
<span>print</span><span>(</span><span>tree</span><span>)</span>
<span>import</span> <span>nltk</span>

<span>nltk</span><span>.</span><span>download</span><span>(</span><span>"</span><span>averaged_perceptron_tagger</span><span>"</span><span>)</span>
<span>from</span> <span>nltk</span> <span>import</span> <span>pos_tag</span><span>,</span> <span>word_tokenize</span>
<span>from</span> <span>nltk.chunk</span> <span>import</span> <span>RegexpParser</span>

<span>text</span> <span>=</span> <span>"</span><span>I visited New York last summer.</span><span>"</span>
<span>tokens</span> <span>=</span> <span>word_tokenize</span><span>(</span><span>text</span><span>)</span>
<span>pos_tags</span> <span>=</span> <span>pos_tag</span><span>(</span><span>tokens</span><span>)</span>

<span>chunker</span> <span>=</span> <span>RegexpParser</span><span>(</span><span>r</span><span>"</span><span>NP: {<DT>?<JJ>*<NN.*>+}</span><span>"</span><span>)</span>
<span>tree</span> <span>=</span> <span>chunker</span><span>.</span><span>parse</span><span>(</span><span>pos_tags</span><span>)</span>

<span>print</span><span>(</span><span>tree</span><span>)</span>
import nltk nltk.download("averaged_perceptron_tagger") from nltk import pos_tag, word_tokenize from nltk.chunk import RegexpParser text = "I visited New York last summer." tokens = word_tokenize(text) pos_tags = pos_tag(tokens) chunker = RegexpParser(r"NP: {<DT>?<JJ>*<NN.*>+}") tree = chunker.parse(pos_tags) print(tree)

Enter fullscreen mode Exit fullscreen mode

Why is this useful?

  • Helps NER, question answering, and text summarization.

6️⃣ Handling Synonyms – Replacing Words with Similar Meanings

Problem:

Different words have the same meaning, but NLP models treat them separately:

  • "big""large"
  • "fast""quick"

Solution: Use WordNet to replace words with synonyms.

<span>from</span> <span>nltk.corpus</span> <span>import</span> <span>wordnet</span>
<span>word</span> <span>=</span> <span>"</span><span>happy</span><span>"</span>
<span>synonyms</span> <span>=</span> <span>set</span><span>()</span>
<span>for</span> <span>syn</span> <span>in</span> <span>wordnet</span><span>.</span><span>synsets</span><span>(</span><span>word</span><span>):</span>
<span>for</span> <span>lemma</span> <span>in</span> <span>syn</span><span>.</span><span>lemmas</span><span>():</span>
<span>synonyms</span><span>.</span><span>add</span><span>(</span><span>lemma</span><span>.</span><span>name</span><span>())</span>
<span>print</span><span>(</span><span>synonyms</span><span>)</span> <span># Output: {'glad', 'happy', 'elated', 'joyous'} </span>
<span>from</span> <span>nltk.corpus</span> <span>import</span> <span>wordnet</span>

<span>word</span> <span>=</span> <span>"</span><span>happy</span><span>"</span>
<span>synonyms</span> <span>=</span> <span>set</span><span>()</span>

<span>for</span> <span>syn</span> <span>in</span> <span>wordnet</span><span>.</span><span>synsets</span><span>(</span><span>word</span><span>):</span>
    <span>for</span> <span>lemma</span> <span>in</span> <span>syn</span><span>.</span><span>lemmas</span><span>():</span>
        <span>synonyms</span><span>.</span><span>add</span><span>(</span><span>lemma</span><span>.</span><span>name</span><span>())</span>

<span>print</span><span>(</span><span>synonyms</span><span>)</span>  <span># Output: {'glad', 'happy', 'elated', 'joyous'} </span>
from nltk.corpus import wordnet word = "happy" synonyms = set() for syn in wordnet.synsets(word): for lemma in syn.lemmas(): synonyms.add(lemma.name()) print(synonyms) # Output: {'glad', 'happy', 'elated', 'joyous'}

Enter fullscreen mode Exit fullscreen mode

Why is this useful?

  • Helps improve search engines and document clustering.

7️⃣ Handling Rare Words – Replacing Uncommon Words

Problem:

Some words appear very rarely and can be replaced with <UNK> to improve model performance.

Solution: Replace words that appear less than 5 times in a corpus.

<span>from</span> <span>collections</span> <span>import</span> <span>Counter</span>
<span>corpus</span> <span>=</span> <span>[</span><span>"</span><span>apple</span><span>"</span><span>,</span> <span>"</span><span>banana</span><span>"</span><span>,</span> <span>"</span><span>banana</span><span>"</span><span>,</span> <span>"</span><span>apple</span><span>"</span><span>,</span> <span>"</span><span>cherry</span><span>"</span><span>,</span> <span>"</span><span>dragonfruit</span><span>"</span><span>,</span> <span>"</span><span>mango</span><span>"</span><span>]</span>
<span>word_counts</span> <span>=</span> <span>Counter</span><span>(</span><span>corpus</span><span>)</span>
<span>processed_corpus</span> <span>=</span> <span>[</span><span>word</span> <span>if</span> <span>word_counts</span><span>[</span><span>word</span><span>]</span> <span>></span> <span>1</span> <span>else</span> <span>"</span><span><UNK></span><span>"</span> <span>for</span> <span>word</span> <span>in</span> <span>corpus</span><span>]</span>
<span>print</span><span>(</span><span>processed_corpus</span><span>)</span>
<span>from</span> <span>collections</span> <span>import</span> <span>Counter</span>

<span>corpus</span> <span>=</span> <span>[</span><span>"</span><span>apple</span><span>"</span><span>,</span> <span>"</span><span>banana</span><span>"</span><span>,</span> <span>"</span><span>banana</span><span>"</span><span>,</span> <span>"</span><span>apple</span><span>"</span><span>,</span> <span>"</span><span>cherry</span><span>"</span><span>,</span> <span>"</span><span>dragonfruit</span><span>"</span><span>,</span> <span>"</span><span>mango</span><span>"</span><span>]</span>
<span>word_counts</span> <span>=</span> <span>Counter</span><span>(</span><span>corpus</span><span>)</span>

<span>processed_corpus</span> <span>=</span> <span>[</span><span>word</span> <span>if</span> <span>word_counts</span><span>[</span><span>word</span><span>]</span> <span>></span> <span>1</span> <span>else</span> <span>"</span><span><UNK></span><span>"</span> <span>for</span> <span>word</span> <span>in</span> <span>corpus</span><span>]</span>
<span>print</span><span>(</span><span>processed_corpus</span><span>)</span>
from collections import Counter corpus = ["apple", "banana", "banana", "apple", "cherry", "dragonfruit", "mango"] word_counts = Counter(corpus) processed_corpus = [word if word_counts[word] > 1 else "<UNK>" for word in corpus] print(processed_corpus)

Enter fullscreen mode Exit fullscreen mode

Output:

['apple', 'banana', 'banana', 'apple', '<UNK>', '<UNK>', '<UNK>']

Why is this useful?

  • Helps reduce vocabulary size for deep learning models.

8️⃣ Text Normalization for Social Media – Fixing Informal Text

Problem:

Social media text is messy and informal:

  • "gonna""going to"
  • "u""you"

Solution: Use custom dictionaries to normalize text.

<span>slang_dict</span> <span>=</span> <span>{</span>
<span>"</span><span>gonna</span><span>"</span><span>:</span> <span>"</span><span>going to</span><span>"</span><span>,</span>
<span>"</span><span>u</span><span>"</span><span>:</span> <span>"</span><span>you</span><span>"</span><span>,</span>
<span>"</span><span>btw</span><span>"</span><span>:</span> <span>"</span><span>by the way</span><span>"</span><span>,</span>
<span>}</span>
<span>text</span> <span>=</span> <span>"</span><span>I</span><span>'</span><span>m gonna text u btw.</span><span>"</span>
<span>for</span> <span>slang</span><span>,</span> <span>expanded</span> <span>in</span> <span>slang_dict</span><span>.</span><span>items</span><span>():</span>
<span>text</span> <span>=</span> <span>text</span><span>.</span><span>replace</span><span>(</span><span>slang</span><span>,</span> <span>expanded</span><span>)</span>
<span>print</span><span>(</span><span>text</span><span>)</span> <span># Output: "I'm going to text you by the way." </span>
<span>slang_dict</span> <span>=</span> <span>{</span>
    <span>"</span><span>gonna</span><span>"</span><span>:</span> <span>"</span><span>going to</span><span>"</span><span>,</span>
    <span>"</span><span>u</span><span>"</span><span>:</span> <span>"</span><span>you</span><span>"</span><span>,</span>
    <span>"</span><span>btw</span><span>"</span><span>:</span> <span>"</span><span>by the way</span><span>"</span><span>,</span>
<span>}</span>

<span>text</span> <span>=</span> <span>"</span><span>I</span><span>'</span><span>m gonna text u btw.</span><span>"</span>
<span>for</span> <span>slang</span><span>,</span> <span>expanded</span> <span>in</span> <span>slang_dict</span><span>.</span><span>items</span><span>():</span>
    <span>text</span> <span>=</span> <span>text</span><span>.</span><span>replace</span><span>(</span><span>slang</span><span>,</span> <span>expanded</span><span>)</span>

<span>print</span><span>(</span><span>text</span><span>)</span>  <span># Output: "I'm going to text you by the way." </span>
slang_dict = { "gonna": "going to", "u": "you", "btw": "by the way", } text = "I'm gonna text u btw." for slang, expanded in slang_dict.items(): text = text.replace(slang, expanded) print(text) # Output: "I'm going to text you by the way."

Enter fullscreen mode Exit fullscreen mode

Why is this useful?

  • Helps chatbots understand informal messages.

Wrapping Up: Advanced NLP Preprocessing

We explored advanced NLP techniques to enhance text processing:

Handling Dates & Times → Standardizes dates into a common format.

Text Augmentation → Creates more training data.

Handling Negations → Prevents incorrect sentiment analysis.

Dependency Parsing → Extracts sentence structure.

Text Chunking → Groups words into meaningful phrases.

Handling Synonyms → Improves search relevance.

Handling Rare Words → Reduces vocabulary size.

Social Media Normalization → Converts informal text to standard English.

These techniques help NLP models understand language more accurately.

Next Up: Deep learning-based NLP methods like transformers and word embeddings!

原文链接:Advanced NLP Tasks: Taking Text Processing to the Next Level

© 版权声明
THE END
喜欢就支持一下吧
点赞7 分享
Only they who fulfill their duties in everyday matters will fulfill them on great occasions.
只有在日常生活中尽责的人才会在重大时刻尽责
评论 抢沙发

请登录后发表评论

    暂无评论内容