Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Predicting Heart Disease - Lecture Notes | MGS 8040, Papers of Business Management and Analysis

Material Type: Paper; Class: DATA MINING; Subject: MANAGERIAL SCIENCES; University: Georgia State University; Term: Spring 2007;

Typology: Papers

Pre 2010

Uploaded on 09/02/2009

koofers-user-d1g-1
koofers-user-d1g-1 🇺🇸

10 documents

1 / 22

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
PREDICTING
HEART DISEASE
SEARCHING FO R C O MMON FACTORS
IN THE DIA G N O S IS OF HEART DISEASE
H E A R T R E S E A R C H C O N S U L T A N T S , I N C .
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16

Partial preview of the text

Download Predicting Heart Disease - Lecture Notes | MGS 8040 and more Papers Business Management and Analysis in PDF only on Docsity!

PR EDICT ING

HE ART DISEA SE

SEARCHING FOR COMMON FACTORS

IN THE DIAGNOSIS OF HEART DISEASE

H E A R T R E S E A R C H C O N S U L T A N T S , I N C.
PARIMAL DESAI | MICHAEL REINHOLD | PAUL SPOSITO
MGS 8040
SPRING 2007

 the performance of the model used to forecast future data  recommendations derived from the analysis and the prediction model RE G RE SS I O N M E T H O D O L O G Y The main objective of this paper is to create a model by which physicians and patients can clearly see the risk factors surrounding heart disease and how greatly (or minimally) each of these factors contributes to the disease. This model will be created by way of the following processes:  data description  data preparation  datasets  dummy variables  regression DATA DESCRIPTION Source The data used for this model comes from CorMac Technologies, which collected the data from four participating hospitals: Cleveland Clinic Foundation, Hungarian Institute of Cardiology, V.A. Medical Center (Long Beach, CA), and University Hospital (in Zurich, Switzerland). The data is publicly available at http://www.cormactech.com/neunet. Independent Variables The raw database contains 76 attributes of patients who were and were not diagnosed with heart disease. However, only fourteen of these 76 independent variables were actually used. The following table displays the independent variables beneath their specific type of variable: Demographic Numerical Classification (Non-Numerical in Nature) Age Resting Blood Pressure (in mm/Hg) Chest Pain Type Sex Cholesterol (in mg/dl) Exercise-Induced Angina Fasting Blood Sugar ST Depression Induced Resting ECG Slope of Peak Exercise ST Maximum Heart Rate Achieved Defect Classification Vessels Colored by Fluoroscopy

The complete list of variables and their descriptions are located in Appendix B. The list of those variables actually used in the prediction model are located in Appendix A. Dependent Variable The 58th^ variable (“num”) is the dependent variable, “diagnosis of heart disease.” This variable is a binary value that refers to the presence of heart disease as “1” (>50% narrowing of blood vessels) or no presence of heart disease as “0” (<50% narrowing of blood vessels). Observations The total number of observations for all four participating hospitals is

  1. However, the Cleveland dataset has been the one used containing the most appropriate data; it contains only 303 observations. DATA PREPARATION The data originated from comma-separated text files, which were directly loaded into SAS with minimal additonal formatting. In addition, no additional variables were added to or subtracted from the dataset. DATASETS The dataset containing the research by the four hospitals has 303 observations. This dataset was then split into a development dataset containing 151 datasets, and a validation dataset containing 152 observations. The observations were split using a random integer, with the validation set taking the lowest half of the observations and the development set taking the highest. DUMMY VARIABLES In determing the dummy variables of the dataset, frequency tables were created for each variable. Each frequency table grouped observation values into ranges between 2 – 10%, in order to identify outliers, missing values, and other special cases. These frequency tables are located in their entirety in Appendix C. Normally, we would take this opportunity to find variables that show no correlation with the dependent variable or have comparatively small sample sizes, and then remove them from consideration in the model. However,

 condition of angina  the number of blood vessels colored by fluoroscopy Type of Heart Defect The type of heart defect a patient has can play a significant role in whether that patient will develop heart disease. The data in the original report classifies the presence of heart defects into three groups:  normal (no defect)  reversable defect  fixed defect (no cure) The heart defect of “normal” would be no defect at all, and it has a small amount of cases where patients have been diagnosed with heart disease. The reversable defect seems to result in a higher incidence of heart disease than does no defect, and a fixed (or incurable) defect has a higher incidence even still. Resting Electrocardiograph Results The ECG results do the same thing as the test for a heart defect, as they rate each patient on the severity of those ECG results as follows:  normal  having ST-T wave abnormality  showing probable or definite left ventricular hypertrophy These classifications of an ECG readout can be read similar to the results of the heart defect “scale.” “Normal” would classify a health patient, with little incidence of heart disease. “ST-T wave abnormality” is of greater importance to the patient, though it hardly directly relates to the onset of heart disease, as it can also be the effect of certain prescription drugs, neurogenic factors such as a such, and metabolic factors such as hypoglycemia.^2 However, left ventricle hypertrophy, a very serious condition, is “a thickening of your heart muscle’s main pumping chamber (left ventricle).” It causes the muscle in this area to become overworked, which leads to it wearing out and then eventually failing. A patient with this condition would obviously be extremely susceptible to heart disease.^3 Our calculations and analysis found this to be the case. Severity of Angina

The angina variable simply classifies the patient’s angina as either “good” or “bad.” Angina is, simply, chest pain. It usually occurs “when your heart muscle does not get enough blood.” A symptom of coronary heart disease, it is frequently a sign of atherosclerosis and can eventually lead to a heart attack.^4 There are three different types of angina identified by physicians; however, the data from this research only classifies angina into “good” and “bad,” which doesn’t allow for much in-depth analysis. Our regression analysisconfidently arrived at the obvious – “bad” results showed a sign for a diagnosis of heart disease, while “good” identified a healthy patient. SCORECARD Variable Range Points Age

Angina good -1. Blood Pressure

Chest Pain Typical Angina 0. Non-Anginal Pain -1.

K-S Test for Response Model

0

1 Score 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.

Score Cutoff for Heart Attack

Cumulative^ Percentage

P_Resp

P_NonResp

RE C O M M E N D AT I O N S POSITIVE DIAGNOSIS INDICATORS According to the model, the patients most likely to be diagnosed with heart disease possess the following conditions:  over the age of 59  pretty significant angina  blood pressure over 150  heart rate over 156 NEGATIVE DIAGNOSIS INDICATORS The model also leads us to the following conditions most likely to identify patients without heart disease:  under the age of 54  blood pressure less than 120  heart rate between 118 and 142

SUGGESTIONS Due to the relatively small number of observations available to researchers, the model is less than completely relaiable. A larger dataset would help to solidify the researchers' findings, as well as fine-tune the model to accommodate a larger population. Physicians and lab scientists could use this model to formulate a decision tree in order to quickly and accurately diagnose patients admitted to an emergency room. This would cut down on the likelihood of mis-diagnosis, which could lead to either death or unnecessary costs for both the patient and the hospital. A P P E N D I X APPENDIX A: USED VARIABLES # Code Description Values 3 age age in years

0 = female 5 painloc chest pain location 1 = substernal 0 = otherwise 6 painexer pain provoked by exertion? 1 = yes 0 = no 7 relrest relieved after rest? 1 = yes 0 = no 8 pncaden sum of #5,#6, and # 9 cp chest pain type 1 = typical angina 2 = atypical angina 3 = non-anginal pain 4 = asymptomatic 10 trestbps resting blood pressure (in mmHg on admission to the hospital) 11 htn 12 chol serum cholesterol in mg/dl 13 smoke smoke cigarettes? 1 = yes 0 = no 14 cigs cigarettes per day 15 years number of years as a smoker 16 fbs fasting blood sugar > 120 mg/dl? 1 = true 0 = false 17 dm history of diabetes? 1 = yes 0 = no 18 famhist family history of coronary artery disease 1 = yes 0 = no 19 restecg resting electrocardiographic results 0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria 1 = having ST-T wave abnormality 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria 20 ekgmo month of exercise ECG reading 21 ekgday day of exercise ECG reading 22 ekgyr year of exercise ECG reading 23 dig digitalis used during exercise ECG 1 = yes 0 = no 24 prop Beta blocker used during exercise ECG 1 = yes 0 = no 25 nitr nitrates used during exercise ECG 1 = yes 0 = no 26 pro calcium channel blocker used during exercise ECG 1 = yes 0 = no 27 diuretic diuretic used during exercise ECG 1 = yes 0 = no 28 proto exercise protocol 1 = Bruce

2 = Kottus 3 = McHenry 4 = fast Balke 5 = Balke 6 = Noughton 7 = bike 150 kpa/min 8 = bike 125 kpa/min 9 = bike 100 kpa/min 10 = bike 75 kpa/min 11 = bike 50 kpa/min 12 = arm ergometer 29 thaldur duration of exercise test in minutes 30 thaltime time when ST measure depression was noted 31 met mets achieved 32 thalach maximum heart rate achieved 33 thalrest resting heart rate 34 tpeakbp s peak exercise blood pressure (first of 2 parts) 35 tpeakbp d peak exercise blood pressure (second of 2 parts) 36 dummy 37 trestbpd resting blood pressure 38 exang exercise induced angina 1 = yes 0 = no 39 xhypo 1 = yes 0 = no 40 oldpeak ST depression induced by exercise relative to rest 41 slope the slope of the peak exercise ST segment 1 = unsloping 2 = flat 3 = downsloping 42 rldv5 height at rest 43 rldv5e height at peak exercise 44 ca number of major vessels colored by fluoroscopy 0 - 3 45 restckm irrelevant 46 exerckm irrelevant 47 restef rest raidonuclid ejection fraction 48 restwm rest wall motion abnormality 0 = none 1 = mild or moderate 2 = moderate or severe 3 = akinesis or dyskmem 49 exeref exercise radinalid ejection fraction 50 exerwm exercise wall motion 51 thal type of defect 3 = normal 6 = fixed defect 7 = reversable defect 52 thalsev not used 53 thalpul not used 54 earlobe not used 55 cmo month of cardiac cath 56 cday day of cardiac cath