From Chaos to Clarity: The Journey of Text Cleaning in NLP (Part 2)

We’ve already taken our first steps in cleaning messy text—lowercasing, tokenization, punctuation removal, and filtering stopwords. Now, let’s dig deeper and refine our text even more with techniques that help AI truly understand meaning and structure.

Just as a librarian doesn’t just organize books but also ensures they have correct titles and summaries, we must refine our text further for **optimal machine understanding.

6 Stemming: Reducing Words to Their Root Form

Problem: Words like “running,” “runs,” and “ran” are all forms of the word “run”, but an NLP model might treat them as separate entities, increasing complexity.

Solution: Stemming chops words down to their root form using rule-based reductions.

python
from nltk.stem import PorterStemmer
ps = PorterStemmer()

words = [“running”, “flies”, “easily”, “loving”]
stemmed_words = [ps.stem(word) for word in words]

print(stemmed_words)

Before: [“running”, “flies”, “easily”, “loving”]

After: [“run”, “fli”, “easili”, “love”]

Why is this useful?

Reduces vocabulary size, improving computational efficiency.
Helps NLP models group similar words together.

Limitations: Stemming is a crude process that doesn’t consider actual word meanings (e.g., “flies” became “fli”). That’s where lemmatization comes in!

7 Lemmatization: Converting Words to Their Dictionary Base Form

Problem: Stemming blindly cuts words, sometimes producing meaningless roots. Instead, we need a smarter way to map words to their **actual base form.

Solution: Lemmatization converts words to their dictionary base form using linguistic rules.

python
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download(‘wordnet’)

lemmatizer = WordNetLemmatizer()

words = [“running”, “flies”, “easily”, “better”]
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

print(lemmatized_words)

Before: [“running”, “flies”, “easily”, “better”]

After: [“running”, “fly”, “easily”, “better”]

Why is this useful?

More accurate than stemming (e.g., “flies” becomes “fly”, not “fli”).
Keeps words intelligible and meaningful.

8 Removing Numbers: Filtering Out Non-Useful Digits

Problem: Many texts contain numbers that aren’t useful for NLP tasks. Consider:

“The price is 50 dollars” → Here, 50 is important.
“Call me at 9876543210” → The number isn’t useful for NLP processing.

Solution: Remove numbers only when they don’t add meaning.

python
import re
text = “The AI model trained on 50000 samples in 2023.”
clean_text = re.sub(r’\d+’, ”, text)
print(clean_text)

Before: “The AI model trained on 50000 samples in 2023.”

After: “The AI model trained on samples in.”

Why is this useful?

Helps models focus on actual language instead of numerical noise.
Reduces irrelevant variations in text.

9 Handling Contractions: Expanding Words for Better Understanding

Problem: Text in conversations and social media often contains contractions like:

“I’m” → “I am”
“won’t” → “will not”
“they’ve” → “they have”

Solution: Expand contractions into full words for clarity.

python
import contractions
text = “I’m learning NLP because it’s awesome!”
expanded_text = contractions.fix(text)
print(expanded_text)

Before: “I’m learning NLP because it’s awesome!”

After: “I am learning NLP because it is awesome!”

Why is this useful?

Improves text clarity by converting informal language into standard English.
Helps NLP models understand intent and meaning better.

Removing Special Characters: Eliminating Unnecessary Symbols

Problem: Text often contains special symbols like @, #, $, %, & which are irrelevant to NLP.

Solution: Strip special characters while keeping meaningful text intact.

python
text = “Hello @world! This is #NLP with $100% efficiency.”
clean_text = re.sub(r'[^a-zA-Z0-9\s]’, ”, text)
print(clean_text)

Before: “Hello @world! This is #NLP with $100% efficiency.”

After: “Hello world This is NLP with 100 efficiency”

Why is this useful?

Removes unnecessary noise from text.
Focuses on words that add value to NLP models.

Bringing It All Together: The Power of Preprocessing

Each of these text-cleaning steps plays a critical role in preparing data for machine learning models.

Lowercasing ensures uniformity.

Tokenization breaks text into meaningful chunks.

Punctuation & stopword removal reduce noise.

Stemming & lemmatization standardize words for better comprehension.

Handling contractions, numbers, and special characters refines the text further.

Before Preprocessing:

“I’m LOVING NLP! Running & learning AI with 50,000 samples is fun!! ”

After Preprocessing:

“I am love NLP run learn AI sample fun”

What’s Next?

With clean text, we can now move into advanced NLP techniques, such as **POS tagging, Named Entity Recognition (NER), Vectorization, and Deep Learning-based embeddings.

Want to go deeper? Stay tuned for the next part of our NLP journey where we transform cleaned text into structured machine-readable data!

原文链接：From Chaos to Clarity: The Journey of Text Cleaning in NLP (Part 2)

展开阅读全文

文章版权声明 1、本网站名称：拾光赋
2、本站永久网址：https://www.blogs.ink
3、本网站的文章部分内容可能来源于网络，仅供大家学习与参考，如有侵权，请联系站长QQ：805375623进行删除处理。
4、本站一切资源不代表本站立场，并不代表本站赞同其观点和对其真实性负责。
5、本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报
6、本站资源大多存储在云盘，如发现链接失效，请联系我们我们会第一时间更新。

THE END