Low F1-Score due to Imbalanced Dataset even after resampling

I am performing a Binary Classification over an imbalanced dataset:

0: 16,263

1: 214

I have used multiple oversampling, undersampling, and combination techniques, below are the results that I have obtained:
I obtained this plots thanks to this piece of code:


def plot_resampling(X, y, sampler, ax, title=None):
    X_res, y_res = sampler.fit_resample(X, y)
    ax.scatter(X_res[:, 0], X_res[:, 1], c=y_res, alpha=0.8, edgecolor="k")
    if title is None:
        title = f"Resampling with {sampler.__class__.__name__}"
    ax.set_title(title)
    sns.despine(ax=ax, offset=10)
def plot_resampling(X, y, sampler, ax, title=None):
    X_res, y_res = sampler.fit_resample(X, y)
    ax.scatter(X_res[:, 0], X_res[:, 1], c=y_res, alpha=0.8, edgecolor="k")
    if title is None:
        title = f"Resampling with {sampler.__class__.__name__}"
    ax.set_title(title)
    sns.despine(ax=ax, offset=10)
def plot_resampling(X, y, sampler, ax, title=None):
    X_res, y_res = sampler.fit_resample(X, y)
    ax.scatter(X_res[:, 0], X_res[:, 1], c=y_res, alpha=0.8, edgecolor="k")
    if title is None:
        title = f"Resampling with {sampler.__class__.__name__}"
    ax.set_title(title)
    sns.despine(ax=ax, offset=10)

Enter fullscreen mode Exit fullscreen mode

Clarification: The X and y are the X_train and y_train and I used it to show the distribution of my data points before and after the resampling.

For the RandomUnderSampler, the first one is without replacement and the second one is with replacement=True

You need to know also that I have multiple outliers in my dataset, and hence, multiple columns are skewed, so I chose to use models that are not sensitive to skewness like:

SVC
Naive Bayes Classifier
Ensemble XGboost
KNN

For now, the best result that I have obtained is with SVC(kernel = “rbf”) and using the SMOTE technique(Of course the sampling is only performed on the training dataset since the test one should represent the real population):

Test Accuracy: 0.75
Training Accuracy: 0.88

But the classification report is not good, the f1-score is 0.51, there is a real issue with the 1 class even after the resampling!! as you can see below:

Here is also the Confusion Matrix:

Can you please help me improve the f1 score, what is your analysis of the situation, and what are your suggestions?

原文链接：Low F1-Score due to Imbalanced Dataset even after resampling

文章版权声明 1、本网站名称：拾光赋
2、本站永久网址：https://www.blogs.ink
3、本网站的文章部分内容可能来源于网络，仅供大家学习与参考，如有侵权，请联系站长QQ：805375623进行删除处理。
4、本站一切资源不代表本站立场，并不代表本站赞同其观点和对其真实性负责。
5、本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报
6、本站资源大多存储在云盘，如发现链接失效，请联系我们我们会第一时间更新。

THE END