How To Build a Machine Learning Model For Heart Failure Prediction From Scratch

Hi everyone! Today I will show you how to build a Machine Learning model for heart failure prediction from scratch. For this tutorial, we will use a dataset from kaggle.com called Heart Failure Prediction Dataset.

You can learn more about this dataset from the following link:
https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction

You can download the dataset to follow along with this tutorial.

Alright, open your Jupyter Notebook and let’s get started!


Step 1: Data Loading

First of all, let’s load the data using pandas from the file heart.csv, and check the data using df.head() function

<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>
<span>df</span> <span>=</span> <span>pd</span><span>.</span><span>read_csv</span><span>(</span><span>'</span><span>heart.csv</span><span>'</span><span>)</span>
<span>df</span><span>.</span><span>head</span><span>()</span>
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>

<span>df</span> <span>=</span> <span>pd</span><span>.</span><span>read_csv</span><span>(</span><span>'</span><span>heart.csv</span><span>'</span><span>)</span>
<span>df</span><span>.</span><span>head</span><span>()</span>
import pandas as pd df = pd.read_csv('heart.csv') df.head()

Enter fullscreen mode Exit fullscreen mode

If you successfully load the data, the first five rows of data will be shown in your notebook.

Step 2: Data Inspection

Next, let’s dig deeper into the data information using the function df.info().

<span>df</span><span>.</span><span>info</span><span>()</span>
<span>df</span><span>.</span><span>info</span><span>()</span>
df.info()

Enter fullscreen mode Exit fullscreen mode

Run this code, and you will get more important insight about the dataset. The following is the result of this code

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 918 non-null int64
1 Sex 918 non-null object
2 ChestPainType 918 non-null object
3 RestingBP 918 non-null int64
4 Cholesterol 918 non-null int64
5 FastingBS 918 non-null int64
6 RestingECG 918 non-null object
7 MaxHR 918 non-null int64
8 ExerciseAngina 918 non-null object
9 Oldpeak 918 non-null float64
10 ST_Slope 918 non-null object
11 HeartDisease 918 non-null int64
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB
<class 'pandas.core.frame.DataFrame'> RangeIndex: 918 entries, 0 to 917 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 918 non-null int64 1 Sex 918 non-null object 2 ChestPainType 918 non-null object 3 RestingBP 918 non-null int64 4 Cholesterol 918 non-null int64 5 FastingBS 918 non-null int64 6 RestingECG 918 non-null object 7 MaxHR 918 non-null int64 8 ExerciseAngina 918 non-null object 9 Oldpeak 918 non-null float64 10 ST_Slope 918 non-null object 11 HeartDisease 918 non-null int64 dtypes: float64(1), int64(6), object(5) memory usage: 86.2+ KB

Enter fullscreen mode Exit fullscreen mode

From this, the information we can see that the columns such as Sex, ChestPainType, RestingECG, ExerciseAngina, and ST_Slope has the datatype of object and we have to handle that such as all columns will have a numerical datatype.

Step 3: Data Cleaning

At this step, we will handle the column with the object datatype. Keep in mind that the term object here is usually a Python string. To validate that, let’s take the ChestPainType column and see all its unique value

<span>df</span><span>[</span><span>'</span><span>ChestPainType</span><span>'</span><span>].</span><span>value_counts</span><span>()</span>
<span>df</span><span>[</span><span>'</span><span>ChestPainType</span><span>'</span><span>].</span><span>value_counts</span><span>()</span>
df['ChestPainType'].value_counts()

Enter fullscreen mode Exit fullscreen mode

And you will get this result

count
ChestPainType
ASY 496
NAP 203
ATA 173
TA 46
dtype: int64
count
ChestPainType   
ASY 496
NAP 203
ATA 173
TA  46

dtype: int64
count ChestPainType ASY 496 NAP 203 ATA 173 TA 46 dtype: int64

Enter fullscreen mode Exit fullscreen mode

Based on that result, yes indeed all the values are in the form of string.

“Binary” column processor

Let’s start processing from the column that has the binary amount of unique value, in this case, the column Sex and ExerciseAngina.

To process this, we can use pandas map function

<span># Binary: Sex, ExerciseAngina </span><span>df</span><span>[</span><span>'</span><span>Sex</span><span>'</span><span>]</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>Sex</span><span>'</span><span>].</span><span>map</span><span>({</span><span>'</span><span>F</span><span>'</span><span>:</span> <span>0</span><span>,</span> <span>'</span><span>M</span><span>'</span><span>:</span> <span>1</span><span>})</span>
<span>df</span><span>[</span><span>'</span><span>ExerciseAngina</span><span>'</span><span>]</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>ExerciseAngina</span><span>'</span><span>].</span><span>map</span><span>({</span><span>'</span><span>N</span><span>'</span><span>:</span> <span>0</span><span>,</span> <span>'</span><span>Y</span><span>'</span><span>:</span> <span>1</span><span>})</span>
<span># Binary: Sex, ExerciseAngina </span><span>df</span><span>[</span><span>'</span><span>Sex</span><span>'</span><span>]</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>Sex</span><span>'</span><span>].</span><span>map</span><span>({</span><span>'</span><span>F</span><span>'</span><span>:</span> <span>0</span><span>,</span> <span>'</span><span>M</span><span>'</span><span>:</span> <span>1</span><span>})</span>
<span>df</span><span>[</span><span>'</span><span>ExerciseAngina</span><span>'</span><span>]</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>ExerciseAngina</span><span>'</span><span>].</span><span>map</span><span>({</span><span>'</span><span>N</span><span>'</span><span>:</span> <span>0</span><span>,</span> <span>'</span><span>Y</span><span>'</span><span>:</span> <span>1</span><span>})</span>
# Binary: Sex, ExerciseAngina df['Sex'] = df['Sex'].map({'F': 0, 'M': 1}) df['ExerciseAngina'] = df['ExerciseAngina'].map({'N': 0, 'Y': 1})

Enter fullscreen mode Exit fullscreen mode

One-hot encoding

For the column ChestPainType, RestingECG, and ST_Slope, a technique will be required called One-hot encoding. This technique is used to transform categorical variables into a binary format to enhance the performance of machine learning model training.

To process this, we can use pandas function called get_dummies() to generate new additional columns, join the columns into the existing dataframe and drop the original column

<span># One-hot encoding: ChestPainType, RestingECG, ST_Slope </span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>join</span><span>(</span><span>pd</span><span>.</span><span>get_dummies</span><span>(</span><span>df</span><span>[</span><span>'</span><span>ChestPainType</span><span>'</span><span>],</span> <span>prefix</span><span>=</span><span>'</span><span>ChestPainType</span><span>'</span><span>,</span> <span>dtype</span><span>=</span><span>int</span><span>)).</span><span>drop</span><span>([</span><span>'</span><span>ChestPainType</span><span>'</span><span>],</span> <span>axis</span><span>=</span><span>1</span><span>)</span>
<span>df</span> <span>=</span> <span>df</span><span>.</span><span>join</span><span>(</span><span>pd</span><span>.</span><span>get_dummies</span><span>(</span><span>df</span><span>[</span><span>'</span><span>RestingECG</span><span>'</span><span>],</span> <span>prefix</span><span>=</span><span>'</span><span>RestingECG</span><span>'</span><span>,</span> <span>dtype</span><span>=</span><span>int</span><span>)).</span><span>drop</span><span>([</span><span>'</span><span>RestingECG</span><span>'</span><span>],</span> <span>axis</span><span>=</span><span>1</span><span>)</span>
<span>df</span> <span>=</span> <span>df</span><span>.</span><span>join</span><span>(</span><span>pd</span><span>.</span><span>get_dummies</span><span>(</span><span>df</span><span>[</span><span>'</span><span>ST_Slope</span><span>'</span><span>],</span> <span>prefix</span><span>=</span><span>'</span><span>ST_Slope</span><span>'</span><span>,</span> <span>dtype</span><span>=</span><span>int</span><span>)).</span><span>drop</span><span>([</span><span>'</span><span>ST_Slope</span><span>'</span><span>],</span> <span>axis</span><span>=</span><span>1</span><span>)</span>
<span># One-hot encoding: ChestPainType, RestingECG, ST_Slope </span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>join</span><span>(</span><span>pd</span><span>.</span><span>get_dummies</span><span>(</span><span>df</span><span>[</span><span>'</span><span>ChestPainType</span><span>'</span><span>],</span> <span>prefix</span><span>=</span><span>'</span><span>ChestPainType</span><span>'</span><span>,</span> <span>dtype</span><span>=</span><span>int</span><span>)).</span><span>drop</span><span>([</span><span>'</span><span>ChestPainType</span><span>'</span><span>],</span> <span>axis</span><span>=</span><span>1</span><span>)</span>
<span>df</span> <span>=</span> <span>df</span><span>.</span><span>join</span><span>(</span><span>pd</span><span>.</span><span>get_dummies</span><span>(</span><span>df</span><span>[</span><span>'</span><span>RestingECG</span><span>'</span><span>],</span> <span>prefix</span><span>=</span><span>'</span><span>RestingECG</span><span>'</span><span>,</span> <span>dtype</span><span>=</span><span>int</span><span>)).</span><span>drop</span><span>([</span><span>'</span><span>RestingECG</span><span>'</span><span>],</span> <span>axis</span><span>=</span><span>1</span><span>)</span>
<span>df</span> <span>=</span> <span>df</span><span>.</span><span>join</span><span>(</span><span>pd</span><span>.</span><span>get_dummies</span><span>(</span><span>df</span><span>[</span><span>'</span><span>ST_Slope</span><span>'</span><span>],</span> <span>prefix</span><span>=</span><span>'</span><span>ST_Slope</span><span>'</span><span>,</span> <span>dtype</span><span>=</span><span>int</span><span>)).</span><span>drop</span><span>([</span><span>'</span><span>ST_Slope</span><span>'</span><span>],</span> <span>axis</span><span>=</span><span>1</span><span>)</span>
# One-hot encoding: ChestPainType, RestingECG, ST_Slope df = df.join(pd.get_dummies(df['ChestPainType'], prefix='ChestPainType', dtype=int)).drop(['ChestPainType'], axis=1) df = df.join(pd.get_dummies(df['RestingECG'], prefix='RestingECG', dtype=int)).drop(['RestingECG'], axis=1) df = df.join(pd.get_dummies(df['ST_Slope'], prefix='ST_Slope', dtype=int)).drop(['ST_Slope'], axis=1)

Enter fullscreen mode Exit fullscreen mode

After that, let’s see our “cleaned” dataset

<span>df</span><span>.</span><span>info</span><span>()</span>
<span>df</span><span>.</span><span>info</span><span>()</span>
df.info()

Enter fullscreen mode Exit fullscreen mode

And see that all columns will have numerical datatype, and you also see that some new additional columns are added.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 918 non-null int64
1 Sex 918 non-null int64
2 RestingBP 918 non-null int64
3 Cholesterol 918 non-null int64
4 FastingBS 918 non-null int64
5 MaxHR 918 non-null int64
6 ExerciseAngina 918 non-null int64
7 Oldpeak 918 non-null float64
8 HeartDisease 918 non-null int64
9 ChestPainType_ASY 918 non-null int64
10 ChestPainType_ATA 918 non-null int64
11 ChestPainType_NAP 918 non-null int64
12 ChestPainType_TA 918 non-null int64
13 RestingECG_LVH 918 non-null int64
14 RestingECG_Normal 918 non-null int64
15 RestingECG_ST 918 non-null int64
16 ST_Slope_Down 918 non-null int64
17 ST_Slope_Flat 918 non-null int64
18 ST_Slope_Up 918 non-null int64
dtypes: float64(1), int64(18)
memory usage: 136.4 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Age                918 non-null    int64  
 1   Sex                918 non-null    int64  
 2   RestingBP          918 non-null    int64  
 3   Cholesterol        918 non-null    int64  
 4   FastingBS          918 non-null    int64  
 5   MaxHR              918 non-null    int64  
 6   ExerciseAngina     918 non-null    int64  
 7   Oldpeak            918 non-null    float64
 8   HeartDisease       918 non-null    int64  
 9   ChestPainType_ASY  918 non-null    int64  
 10  ChestPainType_ATA  918 non-null    int64  
 11  ChestPainType_NAP  918 non-null    int64  
 12  ChestPainType_TA   918 non-null    int64  
 13  RestingECG_LVH     918 non-null    int64  
 14  RestingECG_Normal  918 non-null    int64  
 15  RestingECG_ST      918 non-null    int64  
 16  ST_Slope_Down      918 non-null    int64  
 17  ST_Slope_Flat      918 non-null    int64  
 18  ST_Slope_Up        918 non-null    int64  
dtypes: float64(1), int64(18)
memory usage: 136.4 KB
<class 'pandas.core.frame.DataFrame'> RangeIndex: 918 entries, 0 to 917 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 918 non-null int64 1 Sex 918 non-null int64 2 RestingBP 918 non-null int64 3 Cholesterol 918 non-null int64 4 FastingBS 918 non-null int64 5 MaxHR 918 non-null int64 6 ExerciseAngina 918 non-null int64 7 Oldpeak 918 non-null float64 8 HeartDisease 918 non-null int64 9 ChestPainType_ASY 918 non-null int64 10 ChestPainType_ATA 918 non-null int64 11 ChestPainType_NAP 918 non-null int64 12 ChestPainType_TA 918 non-null int64 13 RestingECG_LVH 918 non-null int64 14 RestingECG_Normal 918 non-null int64 15 RestingECG_ST 918 non-null int64 16 ST_Slope_Down 918 non-null int64 17 ST_Slope_Flat 918 non-null int64 18 ST_Slope_Up 918 non-null int64 dtypes: float64(1), int64(18) memory usage: 136.4 KB

Enter fullscreen mode Exit fullscreen mode

By the way, you can inspect the dataset visually using the following code

<span>df</span><span>.</span><span>hist</span><span>(</span><span>figsize</span><span>=</span><span>(</span><span>20</span><span>,</span> <span>15</span><span>))</span>
<span>df</span><span>.</span><span>hist</span><span>(</span><span>figsize</span><span>=</span><span>(</span><span>20</span><span>,</span> <span>15</span><span>))</span>
df.hist(figsize=(20, 15))

Enter fullscreen mode Exit fullscreen mode

Step 4: Train Machine Learning Model

Finally, let’s create the machine learning model!

Define Feature & Target Data

First of all, we have to separate the “feature data” and the “target data”

<span>X</span><span>,</span> <span>y</span> <span>=</span> <span>df</span><span>.</span><span>drop</span><span>([</span><span>'</span><span>HeartDisease</span><span>'</span><span>],</span> <span>axis</span><span>=</span><span>1</span><span>),</span> <span>df</span><span>[</span><span>'</span><span>HeartDisease</span><span>'</span><span>]</span>
<span>X</span><span>,</span> <span>y</span> <span>=</span> <span>df</span><span>.</span><span>drop</span><span>([</span><span>'</span><span>HeartDisease</span><span>'</span><span>],</span> <span>axis</span><span>=</span><span>1</span><span>),</span> <span>df</span><span>[</span><span>'</span><span>HeartDisease</span><span>'</span><span>]</span>
X, y = df.drop(['HeartDisease'], axis=1), df['HeartDisease']

Enter fullscreen mode Exit fullscreen mode

Train Test Split

Then, each feature & target data needs to be split into a “train” dataset and a “test” dataset. The training dataset will be used to train the model, and the test dataset will be used to evaluate the performance of the trained model.

<span>from</span> <span>sklearn.model_selection</span> <span>import</span> <span>train_test_split</span>
<span>X_train</span><span>,</span> <span>X_test</span><span>,</span> <span>y_train</span><span>,</span> <span>y_test</span> <span>=</span> <span>train_test_split</span><span>(</span><span>X</span><span>,</span> <span>y</span><span>,</span> <span>test_size</span><span>=</span><span>0.2</span><span>,</span> <span>random_state</span><span>=</span><span>0</span><span>)</span>
<span>from</span> <span>sklearn.model_selection</span> <span>import</span> <span>train_test_split</span>

<span>X_train</span><span>,</span> <span>X_test</span><span>,</span> <span>y_train</span><span>,</span> <span>y_test</span> <span>=</span> <span>train_test_split</span><span>(</span><span>X</span><span>,</span> <span>y</span><span>,</span> <span>test_size</span><span>=</span><span>0.2</span><span>,</span> <span>random_state</span><span>=</span><span>0</span><span>)</span>
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Enter fullscreen mode Exit fullscreen mode

Train Model

Next step, we will train the machine learning model using the train data that we have. Since the objective of this model is to classify whether the patient has heart failure or not, this can be called a classification problem. For a classification problem, there are some machine learning model algorithms and two of them are “Logistic Linear” and “Random Forrest Classifier”. We will implement those two model algorithms and see the performance of each algorithm!

Logistic Regression

Let’s start from logistic regression. To train this model you can use LogisticRegression from the sklearn.linear_model

<span>from</span> <span>sklearn.linear_model</span> <span>import</span> <span>LogisticRegression</span>
<span>log_model</span> <span>=</span> <span>LogisticRegression</span><span>()</span>
<span>log_model</span><span>.</span><span>fit</span><span>(</span><span>X_train</span><span>,</span> <span>y_train</span><span>)</span>
<span>from</span> <span>sklearn.linear_model</span> <span>import</span> <span>LogisticRegression</span>

<span>log_model</span> <span>=</span> <span>LogisticRegression</span><span>()</span>
<span>log_model</span><span>.</span><span>fit</span><span>(</span><span>X_train</span><span>,</span> <span>y_train</span><span>)</span>
from sklearn.linear_model import LogisticRegression log_model = LogisticRegression() log_model.fit(X_train, y_train)

Enter fullscreen mode Exit fullscreen mode

Random Forrest Classifier

Next, let’s see how random forest classifier implementation. We will use RandomForestClassifier from sklearn.ensemble

<span>from</span> <span>sklearn.ensemble</span> <span>import</span> <span>RandomForestClassifier</span>
<span>rfc_model</span> <span>=</span> <span>RandomForestClassifier</span><span>()</span>
<span>rfc_model</span><span>.</span><span>fit</span><span>(</span><span>X_train</span><span>,</span> <span>y_train</span><span>)</span>
<span>from</span> <span>sklearn.ensemble</span> <span>import</span> <span>RandomForestClassifier</span>

<span>rfc_model</span> <span>=</span> <span>RandomForestClassifier</span><span>()</span>
<span>rfc_model</span><span>.</span><span>fit</span><span>(</span><span>X_train</span><span>,</span> <span>y_train</span><span>)</span>
from sklearn.ensemble import RandomForestClassifier rfc_model = RandomForestClassifier() rfc_model.fit(X_train, y_train)

Enter fullscreen mode Exit fullscreen mode

Models Comparison

Finally, let’s see how those models perform

<span>print</span><span>(</span><span>f</span><span>"</span><span>Logistic Regression Score: </span><span>{</span><span>log_model</span><span>.</span><span>score</span><span>(</span><span>X_test</span><span>,</span> <span>y_test</span><span>)</span><span>}</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>f</span><span>"</span><span>Random Forest Classifier Score: </span><span>{</span><span>rfc_model</span><span>.</span><span>score</span><span>(</span><span>X_test</span><span>,</span> <span>y_test</span><span>)</span><span>}</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>f</span><span>"</span><span>Logistic Regression Score: </span><span>{</span><span>log_model</span><span>.</span><span>score</span><span>(</span><span>X_test</span><span>,</span> <span>y_test</span><span>)</span><span>}</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>f</span><span>"</span><span>Random Forest Classifier Score: </span><span>{</span><span>rfc_model</span><span>.</span><span>score</span><span>(</span><span>X_test</span><span>,</span> <span>y_test</span><span>)</span><span>}</span><span>"</span><span>)</span>
print(f"Logistic Regression Score: {log_model.score(X_test, y_test)}") print(f"Random Forest Classifier Score: {rfc_model.score(X_test, y_test)}")

Enter fullscreen mode Exit fullscreen mode

After I run the code above, I get the following output:

Logistic Regression Score: 0.8369565217391305
Random Forest Classifier Score: 0.8532608695652174
Logistic Regression Score: 0.8369565217391305
Random Forest Classifier Score: 0.8532608695652174
Logistic Regression Score: 0.8369565217391305 Random Forest Classifier Score: 0.8532608695652174

Enter fullscreen mode Exit fullscreen mode

From the result, it can be seen that the Random Forrest Classifier scored better, around 85%.


Full code: https://github.com/luthfisauqi17/machine-learning-predictions/blob/main/heart_failure_prediction.ipynb


There you go, that is how you can make a machine-learning model to predict heart failure. You can tweak the code around, and let me know if you found a better solution to make a model with a better score!

Thanks for reading this article, and have a nice day!

原文链接:How To Build a Machine Learning Model For Heart Failure Prediction From Scratch

© 版权声明
THE END
喜欢就支持一下吧
点赞10 分享
Put your heart, mind, and soul into even your smallest acts. This is the secret of success.
即便是再微小不过的事情,你也要用心去做。这就是成功的秘密
评论 抢沙发

请登录后发表评论

    暂无评论内容