Build a Spam Classifier in python

Introduction

Ever since the introduction of emails, humankind has been plagued by spam messages. Any promotional messages or advertisements that end up in our inbox can be categorised as spam as they don’t provide any value and often irritates us.

Overview of the Dataset used

We will make use of the SMS spam classification data.

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged according to being ham (legitimate) or spam.

The data was obtained from UCI’s Machine Learning Repository, alternatively, I have also uploaded the dataset and completed Jupiter notebook onto my GitHub repo.

In this article, we’ll discuss:

Data processing

Import the required packages
Loading the Dataset
Remove the unwanted data columns
Preprocessing and Exploring the Dataset
Build word cloud to see which message is spam and which is not.
Remove the stop words and punctuations
Convert the text data into vectors

Building a classification model

Split the data into train and test sets
Use Sklearn built-in classifiers to build the models
Train the data on the model
Make predictions on new data

Import the required packages

%matplotlib inline
import matplotlib.pyplot as plt
import csv
import sklearn
import pickle
from wordcloud import WordCloud
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import GridSearchCV,train_test_split,StratifiedKFold,cross_val_score,learning_curve

Enter fullscreen mode Exit fullscreen mode

please note! You might find that I have reimported some of these packages again later in the article, it is just for ease of use if I ever have to use those code blocks again in future projects, you may omit those.

Loading the Dataset

data = pd.read_csv('dataset/spam.csv', encoding='latin-1')
data.head()

Enter fullscreen mode Exit fullscreen mode

Removing unwanted columns

From the above figure, we can see that there are some unnamed columns and the label and text column name is not intuitive so let’s fix those in this step.

data = data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
data = data.rename(columns={"v2" : "text", "v1":"label"})
data[1990:2000]

Enter fullscreen mode Exit fullscreen mode

now that the data is looking pretty, let’s move on.

data['label'].value_counts()

//OUTPUT
ham     4825
spam     747
Name: label, dtype: int64

Enter fullscreen mode Exit fullscreen mode

Preprocessing and Exploring the Dataset

If you are completely new to NLTK and Natural Language Processing(NLP) I would recommend checking out this short article before continuing.
Introduction to Word Frequencies in NLP

# Import nltk packages and Punkt Tokenizer Models import nltk
nltk.download("punkt")
import warnings
warnings.filterwarnings('ignore')

Enter fullscreen mode Exit fullscreen mode

Build word cloud to see which message is spam and which is not

ham words are the opposite of spam in this dataset, 🤷‍️ yeah I also don’t have any clue why it is so.

ham_words = ''
spam_words = ''

Enter fullscreen mode Exit fullscreen mode

# Creating a corpus of spam messages
for val in data[data['label'] == 'spam'].text:
    text = val.lower()
    tokens = nltk.word_tokenize(text)
    for words in tokens:
        spam_words = spam_words + words + ' '

# Creating a corpus of ham messages
for val in data[data['label'] == 'ham'].text:
    text = text.lower()
    tokens = nltk.word_tokenize(text)
    for words in tokens:
        ham_words = ham_words + words + ' '

Enter fullscreen mode Exit fullscreen mode

let’s use the above functions to create Spam word cloud and ham word cloud.

spam_wordcloud = WordCloud(width=500, height=300).generate(spam_words)
ham_wordcloud = WordCloud(width=500, height=300).generate(ham_words)

Enter fullscreen mode Exit fullscreen mode

#Spam Word cloud plt.figure( figsize=(10,8), facecolor='w')
plt.imshow(spam_wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

Enter fullscreen mode Exit fullscreen mode

#Creating Ham wordcloud plt.figure( figsize=(10,8), facecolor='g')
plt.imshow(ham_wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

Enter fullscreen mode Exit fullscreen mode

from the spam word cloud, we can see that “free” is most often used in spam.

Now, we can convert the spam and ham into 0 and 1 respectively so that the machine can understand.

data = data.replace(['ham','spam'],[0, 1]) 
data.head(10)

Enter fullscreen mode Exit fullscreen mode

Removing punctuation and stopwords from the messages

Punctuation and stop words do not contribute anything to our model, so we have to remove them. Using NLTK library we can easily do it.

import nltk
nltk.download('stopwords')

#remove the punctuations and stopwords import string
def text_process(text):

    text = text.translate(str.maketrans('', '', string.punctuation))
    text = [word for word in text.split() if word.lower() not in stopwords.words('english')]

    return " ".join(text)

data['text'] = data['text'].apply(text_process)
data.head()

Enter fullscreen mode Exit fullscreen mode

Now, create a data frame from the processed data before moving to the next step.

text = pd.DataFrame(data['text'])
label = pd.DataFrame(data['label'])

Enter fullscreen mode Exit fullscreen mode

Converting words to vectors

we can convert words to vectors using either Count Vectorizer or by using TF-IDF Vectorizer.

TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.

I have included both methods for your reference.

Converting words to vectors using Count Vectorizer

## Counting how many times a word appears in the dataset 
from collections import Counter

total_counts = Counter()
for i in range(len(text)):
    for word in text.values[i][0].split(" "):
        total_counts[word] += 1

print("Total words in data set: ", len(total_counts))

// OUTPUT
Total words in data set:  11305

Enter fullscreen mode Exit fullscreen mode

# Sorting in decreasing order (Word with highest frequency appears first) vocab = sorted(total_counts, key=total_counts.get, reverse=True)
print(vocab[:60])

// OUTPUT
['u', '2', 'call', 'U', 'get', 'Im', 'ur', '4', 'ltgt', 'know', 'go', 'like', 'dont', 'come', 'got', 'time', 'day', 'want', 'Ill', 'lor', 'Call', 'home', 'send', 'going', 'one', 'need', 'Ok', 'good', 'love', 'back', 'n', 'still', 'text', 'im', 'later', 'see', 'da', 'ok', 'think', 'Ì', 'free', 'FREE', 'r', 'today', 'Sorry', 'week', 'phone', 'mobile', 'cant', 'tell', 'take', 'much', 'night', 'way', 'Hey', 'reply', 'work', 'make', 'give', 'new']

Enter fullscreen mode Exit fullscreen mode

# Mapping from words to index 
vocab_size = len(vocab)
word2idx = {}
#print vocab_size for i, word in enumerate(vocab):
    word2idx[word] = I

Enter fullscreen mode Exit fullscreen mode

# Text to Vector def text_to_vector(text):
    word_vector = np.zeros(vocab_size)
    for word in text.split(" "):
        if word2idx.get(word) is None:
            continue
        else:
            word_vector[word2idx.get(word)] += 1
    return np.array(word_vector)

Enter fullscreen mode Exit fullscreen mode

# Convert all titles to vectors word_vectors = np.zeros((len(text), len(vocab)), dtype=np.int_)
for i, (_, text_) in enumerate(text.iterrows()):
    word_vectors[i] = text_to_vector(text_[0])

word_vectors.shape

// OUTPUT
(5572, 11305)

Enter fullscreen mode Exit fullscreen mode

Converting words to vectors using TF-IDF Vectorizer

#convert the text data into vectors from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(data['text'])
vectors.shape

// OUTPUT
(5572, 9376)

Enter fullscreen mode Exit fullscreen mode

#features = word_vectors features = vectors

Enter fullscreen mode Exit fullscreen mode

Splitting into training and test set

#split the dataset into train and test set X_train, X_test, y_train, y_test = train_test_split(features, data['label'], test_size=0.15, random_state=111)

Enter fullscreen mode Exit fullscreen mode

Classifying using sklearn pre-built classifiers

#import sklearn packages for building classifiers from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

Enter fullscreen mode Exit fullscreen mode

#initialize multiple classification models svc = SVC(kernel='sigmoid', gamma=1.0)
knc = KNeighborsClassifier(n_neighbors=49)
mnb = MultinomialNB(alpha=0.2)
dtc = DecisionTreeClassifier(min_samples_split=7, random_state=111)
lrc = LogisticRegression(solver='liblinear', penalty='l1')
rfc = RandomForestClassifier(n_estimators=31, random_state=111)

Enter fullscreen mode Exit fullscreen mode

#create a dictionary of variables and models clfs = {'SVC' : svc,'KN' : knc, 'NB': mnb, 'DT': dtc, 'LR': lrc, 'RF': rfc}

Enter fullscreen mode Exit fullscreen mode

#fit the data onto the models def train(clf, features, targets):    
    clf.fit(features, targets)

def predict(clf, features):
    return (clf.predict(features))

Enter fullscreen mode Exit fullscreen mode

pred_scores_word_vectors = []
for k,v in clfs.items():
    train(v, X_train, y_train)
    pred = predict(v, X_test)
    pred_scores_word_vectors.append((k, [accuracy_score(y_test , pred)]))

Enter fullscreen mode Exit fullscreen mode

Predictions using TFIDF Vectorizer algorithm

pred_scores_word_vectors

// OUTPUT
[('SVC', [0.9784688995215312]),
 ('KN', [0.9330143540669856]),
 ('NB', [0.9880382775119617]),
 ('DT', [0.9605263157894737]),
 ('LR', [0.9533492822966507]),
 ('RF', [0.9796650717703349])]

Enter fullscreen mode Exit fullscreen mode

Model predictions

#write functions to detect if the message is spam or not def find(x):
    if x == 1:
        print ("Message is SPAM")
    else:
        print ("Message is NOT Spam")

Enter fullscreen mode Exit fullscreen mode

newtext = ["Free entry"]
integers = vectorizer.transform(newtext)

Enter fullscreen mode Exit fullscreen mode

x = mnb.predict(integers)
find(x)        

// OUTPUT
Message is SPAM

Enter fullscreen mode Exit fullscreen mode

Checking Classification Results with Confusion Matrix

from sklearn.metrics import confusion_matrix
import seaborn as sns
# Naive Bayes y_pred_nb = mnb.predict(X_test)
y_true_nb = y_test
cm = confusion_matrix(y_true_nb, y_pred_nb)
f, ax = plt.subplots(figsize =(5,5))
sns.heatmap(cm,annot = True,linewidths=0.5,linecolor="red",fmt = ".0f",ax=ax)
plt.xlabel("y_pred_nb")
plt.ylabel("y_true_nb")
plt.show()

Enter fullscreen mode Exit fullscreen mode

from the confusion matrix, we can see that the Naive Bayes model is balanced. That’s it !!
we have successfully created a spam classifier.

Thanks for the read!

You can find the code and dataset here.

If you liked this tutorial please do share it with your friends or on social media!

In case of any suggestions, corrections or if you want to have a chat about data science? Ping me on Twitter

原文链接：Build a Spam Classifier in python

文章版权声明 1、本网站名称：拾光赋
2、本站永久网址：https://www.blogs.ink
3、本网站的文章部分内容可能来源于网络，仅供大家学习与参考，如有侵权，请联系站长QQ：805375623进行删除处理。
4、本站一切资源不代表本站立场，并不代表本站赞同其观点和对其真实性负责。
5、本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报
6、本站资源大多存储在云盘，如发现链接失效，请联系我们我们会第一时间更新。

THE END