Hi everyone! Today I will show you how to build a Machine Learning model for heart failure prediction from scratch. For this tutorial, we will use a dataset from kaggle.com called Heart Failure Prediction Dataset.
You can learn more about this dataset from the following link:
https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction
You can download the dataset to follow along with this tutorial.
Alright, open your Jupyter Notebook and let’s get started!
Step 1: Data Loading
First of all, let’s load the data using pandas
from the file heart.csv
, and check the data using df.head()
function
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>read_csv</span><span>(</span><span>'</span><span>heart.csv</span><span>'</span><span>)</span><span>df</span><span>.</span><span>head</span><span>()</span><span>import</span> <span>pandas</span> <span>as</span> <span>pd</span> <span>df</span> <span>=</span> <span>pd</span><span>.</span><span>read_csv</span><span>(</span><span>'</span><span>heart.csv</span><span>'</span><span>)</span> <span>df</span><span>.</span><span>head</span><span>()</span>import pandas as pd df = pd.read_csv('heart.csv') df.head()
Enter fullscreen mode Exit fullscreen mode
If you successfully load the data, the first five rows of data will be shown in your notebook.
Step 2: Data Inspection
Next, let’s dig deeper into the data information using the function df.info()
.
<span>df</span><span>.</span><span>info</span><span>()</span><span>df</span><span>.</span><span>info</span><span>()</span>df.info()
Enter fullscreen mode Exit fullscreen mode
Run this code, and you will get more important insight about the dataset. The following is the result of this code
<class 'pandas.core.frame.DataFrame'>RangeIndex: 918 entries, 0 to 917Data columns (total 12 columns):# Column Non-Null Count Dtype--- ------ -------------- -----0 Age 918 non-null int641 Sex 918 non-null object2 ChestPainType 918 non-null object3 RestingBP 918 non-null int644 Cholesterol 918 non-null int645 FastingBS 918 non-null int646 RestingECG 918 non-null object7 MaxHR 918 non-null int648 ExerciseAngina 918 non-null object9 Oldpeak 918 non-null float6410 ST_Slope 918 non-null object11 HeartDisease 918 non-null int64dtypes: float64(1), int64(6), object(5)memory usage: 86.2+ KB<class 'pandas.core.frame.DataFrame'> RangeIndex: 918 entries, 0 to 917 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 918 non-null int64 1 Sex 918 non-null object 2 ChestPainType 918 non-null object 3 RestingBP 918 non-null int64 4 Cholesterol 918 non-null int64 5 FastingBS 918 non-null int64 6 RestingECG 918 non-null object 7 MaxHR 918 non-null int64 8 ExerciseAngina 918 non-null object 9 Oldpeak 918 non-null float64 10 ST_Slope 918 non-null object 11 HeartDisease 918 non-null int64 dtypes: float64(1), int64(6), object(5) memory usage: 86.2+ KB<class 'pandas.core.frame.DataFrame'> RangeIndex: 918 entries, 0 to 917 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 918 non-null int64 1 Sex 918 non-null object 2 ChestPainType 918 non-null object 3 RestingBP 918 non-null int64 4 Cholesterol 918 non-null int64 5 FastingBS 918 non-null int64 6 RestingECG 918 non-null object 7 MaxHR 918 non-null int64 8 ExerciseAngina 918 non-null object 9 Oldpeak 918 non-null float64 10 ST_Slope 918 non-null object 11 HeartDisease 918 non-null int64 dtypes: float64(1), int64(6), object(5) memory usage: 86.2+ KB
Enter fullscreen mode Exit fullscreen mode
From this, the information we can see that the columns such as Sex
, ChestPainType
, RestingECG
, ExerciseAngina
, and ST_Slope
has the datatype of object
and we have to handle that such as all columns will have a numerical datatype.
Step 3: Data Cleaning
At this step, we will handle the column with the object
datatype. Keep in mind that the term object
here is usually a Python string. To validate that, let’s take the ChestPainType
column and see all its unique value
<span>df</span><span>[</span><span>'</span><span>ChestPainType</span><span>'</span><span>].</span><span>value_counts</span><span>()</span><span>df</span><span>[</span><span>'</span><span>ChestPainType</span><span>'</span><span>].</span><span>value_counts</span><span>()</span>df['ChestPainType'].value_counts()
Enter fullscreen mode Exit fullscreen mode
And you will get this result
countChestPainTypeASY 496NAP 203ATA 173TA 46dtype: int64count ChestPainType ASY 496 NAP 203 ATA 173 TA 46 dtype: int64count ChestPainType ASY 496 NAP 203 ATA 173 TA 46 dtype: int64
Enter fullscreen mode Exit fullscreen mode
Based on that result, yes indeed all the values are in the form of string.
“Binary” column processor
Let’s start processing from the column that has the binary amount of unique value, in this case, the column Sex
and ExerciseAngina
.
To process this, we can use pandas
map function
<span># Binary: Sex, ExerciseAngina </span><span>df</span><span>[</span><span>'</span><span>Sex</span><span>'</span><span>]</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>Sex</span><span>'</span><span>].</span><span>map</span><span>({</span><span>'</span><span>F</span><span>'</span><span>:</span> <span>0</span><span>,</span> <span>'</span><span>M</span><span>'</span><span>:</span> <span>1</span><span>})</span><span>df</span><span>[</span><span>'</span><span>ExerciseAngina</span><span>'</span><span>]</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>ExerciseAngina</span><span>'</span><span>].</span><span>map</span><span>({</span><span>'</span><span>N</span><span>'</span><span>:</span> <span>0</span><span>,</span> <span>'</span><span>Y</span><span>'</span><span>:</span> <span>1</span><span>})</span><span># Binary: Sex, ExerciseAngina </span><span>df</span><span>[</span><span>'</span><span>Sex</span><span>'</span><span>]</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>Sex</span><span>'</span><span>].</span><span>map</span><span>({</span><span>'</span><span>F</span><span>'</span><span>:</span> <span>0</span><span>,</span> <span>'</span><span>M</span><span>'</span><span>:</span> <span>1</span><span>})</span> <span>df</span><span>[</span><span>'</span><span>ExerciseAngina</span><span>'</span><span>]</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>ExerciseAngina</span><span>'</span><span>].</span><span>map</span><span>({</span><span>'</span><span>N</span><span>'</span><span>:</span> <span>0</span><span>,</span> <span>'</span><span>Y</span><span>'</span><span>:</span> <span>1</span><span>})</span># Binary: Sex, ExerciseAngina df['Sex'] = df['Sex'].map({'F': 0, 'M': 1}) df['ExerciseAngina'] = df['ExerciseAngina'].map({'N': 0, 'Y': 1})
Enter fullscreen mode Exit fullscreen mode
One-hot encoding
For the column ChestPainType
, RestingECG
, and ST_Slope
, a technique will be required called One-hot encoding. This technique is used to transform categorical variables into a binary format to enhance the performance of machine learning model training.
To process this, we can use pandas
function called get_dummies()
to generate new additional columns, join the columns into the existing dataframe and drop the original column
<span># One-hot encoding: ChestPainType, RestingECG, ST_Slope </span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>join</span><span>(</span><span>pd</span><span>.</span><span>get_dummies</span><span>(</span><span>df</span><span>[</span><span>'</span><span>ChestPainType</span><span>'</span><span>],</span> <span>prefix</span><span>=</span><span>'</span><span>ChestPainType</span><span>'</span><span>,</span> <span>dtype</span><span>=</span><span>int</span><span>)).</span><span>drop</span><span>([</span><span>'</span><span>ChestPainType</span><span>'</span><span>],</span> <span>axis</span><span>=</span><span>1</span><span>)</span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>join</span><span>(</span><span>pd</span><span>.</span><span>get_dummies</span><span>(</span><span>df</span><span>[</span><span>'</span><span>RestingECG</span><span>'</span><span>],</span> <span>prefix</span><span>=</span><span>'</span><span>RestingECG</span><span>'</span><span>,</span> <span>dtype</span><span>=</span><span>int</span><span>)).</span><span>drop</span><span>([</span><span>'</span><span>RestingECG</span><span>'</span><span>],</span> <span>axis</span><span>=</span><span>1</span><span>)</span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>join</span><span>(</span><span>pd</span><span>.</span><span>get_dummies</span><span>(</span><span>df</span><span>[</span><span>'</span><span>ST_Slope</span><span>'</span><span>],</span> <span>prefix</span><span>=</span><span>'</span><span>ST_Slope</span><span>'</span><span>,</span> <span>dtype</span><span>=</span><span>int</span><span>)).</span><span>drop</span><span>([</span><span>'</span><span>ST_Slope</span><span>'</span><span>],</span> <span>axis</span><span>=</span><span>1</span><span>)</span><span># One-hot encoding: ChestPainType, RestingECG, ST_Slope </span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>join</span><span>(</span><span>pd</span><span>.</span><span>get_dummies</span><span>(</span><span>df</span><span>[</span><span>'</span><span>ChestPainType</span><span>'</span><span>],</span> <span>prefix</span><span>=</span><span>'</span><span>ChestPainType</span><span>'</span><span>,</span> <span>dtype</span><span>=</span><span>int</span><span>)).</span><span>drop</span><span>([</span><span>'</span><span>ChestPainType</span><span>'</span><span>],</span> <span>axis</span><span>=</span><span>1</span><span>)</span> <span>df</span> <span>=</span> <span>df</span><span>.</span><span>join</span><span>(</span><span>pd</span><span>.</span><span>get_dummies</span><span>(</span><span>df</span><span>[</span><span>'</span><span>RestingECG</span><span>'</span><span>],</span> <span>prefix</span><span>=</span><span>'</span><span>RestingECG</span><span>'</span><span>,</span> <span>dtype</span><span>=</span><span>int</span><span>)).</span><span>drop</span><span>([</span><span>'</span><span>RestingECG</span><span>'</span><span>],</span> <span>axis</span><span>=</span><span>1</span><span>)</span> <span>df</span> <span>=</span> <span>df</span><span>.</span><span>join</span><span>(</span><span>pd</span><span>.</span><span>get_dummies</span><span>(</span><span>df</span><span>[</span><span>'</span><span>ST_Slope</span><span>'</span><span>],</span> <span>prefix</span><span>=</span><span>'</span><span>ST_Slope</span><span>'</span><span>,</span> <span>dtype</span><span>=</span><span>int</span><span>)).</span><span>drop</span><span>([</span><span>'</span><span>ST_Slope</span><span>'</span><span>],</span> <span>axis</span><span>=</span><span>1</span><span>)</span># One-hot encoding: ChestPainType, RestingECG, ST_Slope df = df.join(pd.get_dummies(df['ChestPainType'], prefix='ChestPainType', dtype=int)).drop(['ChestPainType'], axis=1) df = df.join(pd.get_dummies(df['RestingECG'], prefix='RestingECG', dtype=int)).drop(['RestingECG'], axis=1) df = df.join(pd.get_dummies(df['ST_Slope'], prefix='ST_Slope', dtype=int)).drop(['ST_Slope'], axis=1)
Enter fullscreen mode Exit fullscreen mode
After that, let’s see our “cleaned” dataset
<span>df</span><span>.</span><span>info</span><span>()</span><span>df</span><span>.</span><span>info</span><span>()</span>df.info()
Enter fullscreen mode Exit fullscreen mode
And see that all columns will have numerical datatype, and you also see that some new additional columns are added.
<class 'pandas.core.frame.DataFrame'>RangeIndex: 918 entries, 0 to 917Data columns (total 19 columns):# Column Non-Null Count Dtype--- ------ -------------- -----0 Age 918 non-null int641 Sex 918 non-null int642 RestingBP 918 non-null int643 Cholesterol 918 non-null int644 FastingBS 918 non-null int645 MaxHR 918 non-null int646 ExerciseAngina 918 non-null int647 Oldpeak 918 non-null float648 HeartDisease 918 non-null int649 ChestPainType_ASY 918 non-null int6410 ChestPainType_ATA 918 non-null int6411 ChestPainType_NAP 918 non-null int6412 ChestPainType_TA 918 non-null int6413 RestingECG_LVH 918 non-null int6414 RestingECG_Normal 918 non-null int6415 RestingECG_ST 918 non-null int6416 ST_Slope_Down 918 non-null int6417 ST_Slope_Flat 918 non-null int6418 ST_Slope_Up 918 non-null int64dtypes: float64(1), int64(18)memory usage: 136.4 KB<class 'pandas.core.frame.DataFrame'> RangeIndex: 918 entries, 0 to 917 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 918 non-null int64 1 Sex 918 non-null int64 2 RestingBP 918 non-null int64 3 Cholesterol 918 non-null int64 4 FastingBS 918 non-null int64 5 MaxHR 918 non-null int64 6 ExerciseAngina 918 non-null int64 7 Oldpeak 918 non-null float64 8 HeartDisease 918 non-null int64 9 ChestPainType_ASY 918 non-null int64 10 ChestPainType_ATA 918 non-null int64 11 ChestPainType_NAP 918 non-null int64 12 ChestPainType_TA 918 non-null int64 13 RestingECG_LVH 918 non-null int64 14 RestingECG_Normal 918 non-null int64 15 RestingECG_ST 918 non-null int64 16 ST_Slope_Down 918 non-null int64 17 ST_Slope_Flat 918 non-null int64 18 ST_Slope_Up 918 non-null int64 dtypes: float64(1), int64(18) memory usage: 136.4 KB<class 'pandas.core.frame.DataFrame'> RangeIndex: 918 entries, 0 to 917 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 918 non-null int64 1 Sex 918 non-null int64 2 RestingBP 918 non-null int64 3 Cholesterol 918 non-null int64 4 FastingBS 918 non-null int64 5 MaxHR 918 non-null int64 6 ExerciseAngina 918 non-null int64 7 Oldpeak 918 non-null float64 8 HeartDisease 918 non-null int64 9 ChestPainType_ASY 918 non-null int64 10 ChestPainType_ATA 918 non-null int64 11 ChestPainType_NAP 918 non-null int64 12 ChestPainType_TA 918 non-null int64 13 RestingECG_LVH 918 non-null int64 14 RestingECG_Normal 918 non-null int64 15 RestingECG_ST 918 non-null int64 16 ST_Slope_Down 918 non-null int64 17 ST_Slope_Flat 918 non-null int64 18 ST_Slope_Up 918 non-null int64 dtypes: float64(1), int64(18) memory usage: 136.4 KB
Enter fullscreen mode Exit fullscreen mode
By the way, you can inspect the dataset visually using the following code
<span>df</span><span>.</span><span>hist</span><span>(</span><span>figsize</span><span>=</span><span>(</span><span>20</span><span>,</span> <span>15</span><span>))</span><span>df</span><span>.</span><span>hist</span><span>(</span><span>figsize</span><span>=</span><span>(</span><span>20</span><span>,</span> <span>15</span><span>))</span>df.hist(figsize=(20, 15))
Enter fullscreen mode Exit fullscreen mode
Step 4: Train Machine Learning Model
Finally, let’s create the machine learning model!
Define Feature & Target Data
First of all, we have to separate the “feature data” and the “target data”
<span>X</span><span>,</span> <span>y</span> <span>=</span> <span>df</span><span>.</span><span>drop</span><span>([</span><span>'</span><span>HeartDisease</span><span>'</span><span>],</span> <span>axis</span><span>=</span><span>1</span><span>),</span> <span>df</span><span>[</span><span>'</span><span>HeartDisease</span><span>'</span><span>]</span><span>X</span><span>,</span> <span>y</span> <span>=</span> <span>df</span><span>.</span><span>drop</span><span>([</span><span>'</span><span>HeartDisease</span><span>'</span><span>],</span> <span>axis</span><span>=</span><span>1</span><span>),</span> <span>df</span><span>[</span><span>'</span><span>HeartDisease</span><span>'</span><span>]</span>X, y = df.drop(['HeartDisease'], axis=1), df['HeartDisease']
Enter fullscreen mode Exit fullscreen mode
Train Test Split
Then, each feature & target data needs to be split into a “train” dataset and a “test” dataset. The training dataset will be used to train the model, and the test dataset will be used to evaluate the performance of the trained model.
<span>from</span> <span>sklearn.model_selection</span> <span>import</span> <span>train_test_split</span><span>X_train</span><span>,</span> <span>X_test</span><span>,</span> <span>y_train</span><span>,</span> <span>y_test</span> <span>=</span> <span>train_test_split</span><span>(</span><span>X</span><span>,</span> <span>y</span><span>,</span> <span>test_size</span><span>=</span><span>0.2</span><span>,</span> <span>random_state</span><span>=</span><span>0</span><span>)</span><span>from</span> <span>sklearn.model_selection</span> <span>import</span> <span>train_test_split</span> <span>X_train</span><span>,</span> <span>X_test</span><span>,</span> <span>y_train</span><span>,</span> <span>y_test</span> <span>=</span> <span>train_test_split</span><span>(</span><span>X</span><span>,</span> <span>y</span><span>,</span> <span>test_size</span><span>=</span><span>0.2</span><span>,</span> <span>random_state</span><span>=</span><span>0</span><span>)</span>from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Enter fullscreen mode Exit fullscreen mode
Train Model
Next step, we will train the machine learning model using the train data that we have. Since the objective of this model is to classify whether the patient has heart failure or not, this can be called a classification problem. For a classification problem, there are some machine learning model algorithms and two of them are “Logistic Linear” and “Random Forrest Classifier”. We will implement those two model algorithms and see the performance of each algorithm!
Logistic Regression
Let’s start from logistic regression. To train this model you can use LogisticRegression
from the sklearn.linear_model
<span>from</span> <span>sklearn.linear_model</span> <span>import</span> <span>LogisticRegression</span><span>log_model</span> <span>=</span> <span>LogisticRegression</span><span>()</span><span>log_model</span><span>.</span><span>fit</span><span>(</span><span>X_train</span><span>,</span> <span>y_train</span><span>)</span><span>from</span> <span>sklearn.linear_model</span> <span>import</span> <span>LogisticRegression</span> <span>log_model</span> <span>=</span> <span>LogisticRegression</span><span>()</span> <span>log_model</span><span>.</span><span>fit</span><span>(</span><span>X_train</span><span>,</span> <span>y_train</span><span>)</span>from sklearn.linear_model import LogisticRegression log_model = LogisticRegression() log_model.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode
Random Forrest Classifier
Next, let’s see how random forest classifier implementation. We will use RandomForestClassifier
from sklearn.ensemble
<span>from</span> <span>sklearn.ensemble</span> <span>import</span> <span>RandomForestClassifier</span><span>rfc_model</span> <span>=</span> <span>RandomForestClassifier</span><span>()</span><span>rfc_model</span><span>.</span><span>fit</span><span>(</span><span>X_train</span><span>,</span> <span>y_train</span><span>)</span><span>from</span> <span>sklearn.ensemble</span> <span>import</span> <span>RandomForestClassifier</span> <span>rfc_model</span> <span>=</span> <span>RandomForestClassifier</span><span>()</span> <span>rfc_model</span><span>.</span><span>fit</span><span>(</span><span>X_train</span><span>,</span> <span>y_train</span><span>)</span>from sklearn.ensemble import RandomForestClassifier rfc_model = RandomForestClassifier() rfc_model.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode
Models Comparison
Finally, let’s see how those models perform
<span>print</span><span>(</span><span>f</span><span>"</span><span>Logistic Regression Score: </span><span>{</span><span>log_model</span><span>.</span><span>score</span><span>(</span><span>X_test</span><span>,</span> <span>y_test</span><span>)</span><span>}</span><span>"</span><span>)</span><span>print</span><span>(</span><span>f</span><span>"</span><span>Random Forest Classifier Score: </span><span>{</span><span>rfc_model</span><span>.</span><span>score</span><span>(</span><span>X_test</span><span>,</span> <span>y_test</span><span>)</span><span>}</span><span>"</span><span>)</span><span>print</span><span>(</span><span>f</span><span>"</span><span>Logistic Regression Score: </span><span>{</span><span>log_model</span><span>.</span><span>score</span><span>(</span><span>X_test</span><span>,</span> <span>y_test</span><span>)</span><span>}</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>f</span><span>"</span><span>Random Forest Classifier Score: </span><span>{</span><span>rfc_model</span><span>.</span><span>score</span><span>(</span><span>X_test</span><span>,</span> <span>y_test</span><span>)</span><span>}</span><span>"</span><span>)</span>print(f"Logistic Regression Score: {log_model.score(X_test, y_test)}") print(f"Random Forest Classifier Score: {rfc_model.score(X_test, y_test)}")
Enter fullscreen mode Exit fullscreen mode
After I run the code above, I get the following output:
Logistic Regression Score: 0.8369565217391305Random Forest Classifier Score: 0.8532608695652174Logistic Regression Score: 0.8369565217391305 Random Forest Classifier Score: 0.8532608695652174Logistic Regression Score: 0.8369565217391305 Random Forest Classifier Score: 0.8532608695652174
Enter fullscreen mode Exit fullscreen mode
From the result, it can be seen that the Random Forrest Classifier scored better, around 85%.
There you go, that is how you can make a machine-learning model to predict heart failure. You can tweak the code around, and let me know if you found a better solution to make a model with a better score!
Thanks for reading this article, and have a nice day!
原文链接:How To Build a Machine Learning Model For Heart Failure Prediction From Scratch
暂无评论内容