用機器學習預測心臟疾病

前言

心髒病是指一系列影響你心臟正常工作的疾病。心髒病包括血管疾病，如冠狀動脈疾病、心律問題（心律失常）和你出生時的心臟缺陷（先天性心髒病），等等。

心髒病是世界人口中發病和死亡的最大原因之一。心血管疾病的預測被認為是臨床數據科學領域最重要的課題之一。醫療保健行業的數據量是巨大的。

在這個數據科學項目中，我將應用機器學習技術來分析一個人是否患有心髒病。

前处理過程

导入所需模块

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rcParams
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
python

這裡我們將使用KNeighborsClassifier進行實驗：

from sklearn.neighbors import KNeighborsClassifier
python

現在讓我們深入了解一下這些數據

df = pd.read_csv('heart.csv')
print(df.head())
python

png

print(df.info())
python

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 303 entries, 0 to 302
    Data columns (total 14 columns):
     #   Column    Non-Null Count  Dtype  
    ---  ------    --------------  -----  
     0   age       303 non-null    int64  
     1   sex       303 non-null    int64  
     2   cp        303 non-null    int64  
     3   trestbps  303 non-null    int64  
     4   chol      303 non-null    int64  
     5   fbs       303 non-null    int64  
     6   restecg   303 non-null    int64  
     7   thalach   303 non-null    int64  
     8   exang     303 non-null    int64  
     9   oldpeak   303 non-null    float64
     10  slope     303 non-null    int64  
     11  ca        303 non-null    int64  
     12  thal      303 non-null    int64  
     13  target    303 non-null    int64  
    dtypes: float64(1), int64(13)
    memory usage: 33.3 KB
    None

print(df.describe())
python

png

特徵選擇

獲得數據集中每個特徵的相關度，檢查特徵之間的相關性總是比較好的，這樣我們就可以分析出哪個特徵是負相關的，哪個是正相關的，所以，讓我們檢查一下各種特徵之間的相關性。

import seaborn as sns
corrmat = df.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(16,16))
#plot heat map
g=sns.heatmap(df[top_corr_features].corr(),annot=True,cmap="RdYlGn")
plt.show()
python

png

用一個目標類大小大致相等的數據集是一個好的做法。讓我們檢查一下相同的：

sns.set_style('whitegrid')
sns.countplot(x='target',data=df,palette='RdBu_r')
plt.show()
python

png

數據處理

在探索了數據集之後，我觀察到我需要將一些分類變量轉換為虛擬變量，並在訓練機器學習模型之前對所有的數值進行縮放。

首先，我將使用get_dummies方法來為分類變量創建虛擬列。

dataset = pd.get_dummies(df, columns = ['sex', 'cp', 
                                        'fbs','restecg', 
                                        'exang', 'slope', 
                                        'ca', 'thal'])
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
standardScaler = StandardScaler()
columns_to_scale = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
dataset[columns_to_scale] = standardScaler.fit_transform(dataset[columns_to_scale])
dataset.head()
python

y = dataset['target']
X = dataset.drop(['target'], axis = 1)
python

from sklearn.model_selection import cross_val_score
knn_scores = []
for k in range(1,21):
    knn_classifier = KNeighborsClassifier(n_neighbors = k)
    score=cross_val_score(knn_classifier,X,y,cv=10)
    knn_scores.append(score.mean())
python

plt.plot([k for k in range(1, 21)], knn_scores, color = 'red')
for i in range(1,21):
    plt.text(i, knn_scores[i-1], (i, knn_scores[i-1]))
plt.xticks([i for i in range(1, 21)])
plt.xlabel('Number of Neighbors (K)')
plt.ylabel('Scores')
plt.title('K Neighbors Classifier scores for different K values')
plt.show()
python

png

knn_classifier = KNeighborsClassifier(n_neighbors = 12)
score=cross_val_score(knn_classifier,X,y,cv=10)
score.mean()
python

OUTPUT：0.8448387096774195

随机森林分类

from sklearn.ensemble import RandomForestClassifier
randomforest_classifier= RandomForestClassifier(n_estimators=10)
score=cross_val_score(randomforest_classifier,X,y,cv=10)
score.mean()
python

OUTPUT: 0.8183870967741935

結論

1.我們對目標變量做了數據可視化和數據分析，同時對其進行單變量分析和雙變量分析。

2.從上述模型的精確度來看，KNN給我們的精確度是82%。

用機器學習預測心臟疾病

前言

前处理過程

导入所需模块

特徵選擇

數據處理

随机森林分类

結論

Other posts that you might like

ChatGPT讓數據分析更加簡單

我用SDR做了些什么？

六月，我們畢業了

Enjoying this post?