From Chaos to Clarity: The Journey of Text Cleaning in NLP (Part 2)

From Chaos to Clarity: The Journey of Text Cleaning in NLP (Part 2)

We’ve already taken our first steps in cleaning messy text—lowercasing, tokenization, punctuation removal, and filtering stopwords. Now, let’s dig deeper and refine our text even more with techniques that help AI truly understand meaning and structure.

Just as a librarian doesn’t just organize books but also ensures they have correct titles and summaries, we must refine our text further for **optimal machine understanding.


6 Stemming: Reducing Words to Their Root Form

Problem: Words like “running,” “runs,” and “ran” are all forms of the word “run”, but an NLP model might treat them as separate entities, increasing complexity.

Solution: Stemming chops words down to their root form using rule-based reductions.

python
from nltk.stem import PorterStemmer
ps = PorterStemmer()

words = [“running”, “flies”, “easily”, “loving”]
stemmed_words = [ps.stem(word) for word in words]

print(stemmed_words)

Before: [“running”, “flies”, “easily”, “loving”]

After: [“run”, “fli”, “easili”, “love”]

Why is this useful?

  • Reduces vocabulary size, improving computational efficiency.
  • Helps NLP models group similar words together.

Limitations: Stemming is a crude process that doesn’t consider actual word meanings (e.g., “flies” became “fli”). That’s where lemmatization comes in!


7 Lemmatization: Converting Words to Their Dictionary Base Form

Problem: Stemming blindly cuts words, sometimes producing meaningless roots. Instead, we need a smarter way to map words to their **actual base form.

Solution: Lemmatization converts words to their dictionary base form using linguistic rules.

python
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download(‘wordnet’)

lemmatizer = WordNetLemmatizer()

words = [“running”, “flies”, “easily”, “better”]
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

print(lemmatized_words)

Before: [“running”, “flies”, “easily”, “better”]

After: [“running”, “fly”, “easily”, “better”]

Why is this useful?

  • More accurate than stemming (e.g., “flies” becomes “fly”, not “fli”).
  • Keeps words intelligible and meaningful.

8 Removing Numbers: Filtering Out Non-Useful Digits

Problem: Many texts contain numbers that aren’t useful for NLP tasks. Consider:

  • “The price is 50 dollars” → Here, 50 is important.
  • “Call me at 9876543210” → The number isn’t useful for NLP processing.

Solution: Remove numbers only when they don’t add meaning.

python
import re
text = “The AI model trained on 50000 samples in 2023.”
clean_text = re.sub(r’\d+’, ”, text)
print(clean_text)

Before: “The AI model trained on 50000 samples in 2023.”

After: “The AI model trained on samples in.”

Why is this useful?

  • Helps models focus on actual language instead of numerical noise.
  • Reduces irrelevant variations in text.

9 Handling Contractions: Expanding Words for Better Understanding

Problem: Text in conversations and social media often contains contractions like:

  • “I’m” → “I am”
  • “won’t” → “will not”
  • “they’ve” → “they have”

Solution: Expand contractions into full words for clarity.

python
import contractions
text = “I’m learning NLP because it’s awesome!”
expanded_text = contractions.fix(text)
print(expanded_text)

Before: “I’m learning NLP because it’s awesome!”

After: “I am learning NLP because it is awesome!”

Why is this useful?

  • Improves text clarity by converting informal language into standard English.
  • Helps NLP models understand intent and meaning better.

Removing Special Characters: Eliminating Unnecessary Symbols

Problem: Text often contains special symbols like @, #, $, %, & which are irrelevant to NLP.

Solution: Strip special characters while keeping meaningful text intact.

python
text = “Hello @world! This is #NLP with $100% efficiency.”
clean_text = re.sub(r'[^a-zA-Z0-9\s]’, ”, text)
print(clean_text)

Before: “Hello @world! This is #NLP with $100% efficiency.”

After: “Hello world This is NLP with 100 efficiency”

Why is this useful?

  • Removes unnecessary noise from text.
  • Focuses on words that add value to NLP models.

Bringing It All Together: The Power of Preprocessing

Each of these text-cleaning steps plays a critical role in preparing data for machine learning models.

Lowercasing ensures uniformity.

Tokenization breaks text into meaningful chunks.

Punctuation & stopword removal reduce noise.

Stemming & lemmatization standardize words for better comprehension.

Handling contractions, numbers, and special characters refines the text further.

Before Preprocessing:

“I’m LOVING NLP! Running & learning AI with 50,000 samples is fun!! ”

After Preprocessing:

“I am love NLP run learn AI sample fun”


What’s Next?

With clean text, we can now move into advanced NLP techniques, such as **POS tagging, Named Entity Recognition (NER), Vectorization, and Deep Learning-based embeddings.

Want to go deeper? Stay tuned for the next part of our NLP journey where we transform cleaned text into structured machine-readable data!

原文链接:From Chaos to Clarity: The Journey of Text Cleaning in NLP (Part 2)

© 版权声明
THE END
喜欢就支持一下吧
点赞11 分享
If you don't enjoy your life, sorrow, sadness, fear, shame and guilt will.
如果你不好好享受生活,你的悲伤、难过、害怕、羞愧和内疚会代替你享受
评论 抢沙发

请登录后发表评论

    暂无评论内容