A data mining project using the National Health and Nutrition Examination Survey dataset

Atrial fibrillation (AF) is the most common sustained heart rhythm disturbance. At present, 25% of the New Zealand population who are 40 years old or more will experience AF in their lifetime. AF increases morbidity and mortality.

cute girl sitting in between her grandparents

Summer of Research

Project by Josh Atwal, supervised by Dr. Jichao Zhao (University of Auckland).

Atrial fibrillation (AF) is the most common sustained heart rhythm disturbance. At present, 25% of the New Zealand population who are 40 years old or more will experience AF in their lifetime. AF increases morbidity and mortality.

The aim of this project was to utilise the publicly available American National Health and Nutrition Examination Survey (NHANES) dataset to further our understanding of AF by using data mining / statistical techniques to identify key risk factors for AF.

Over the course of this project, a computational framework was developed in the open source language R that uses logistic regression and risk analysis to identify risk factors for AF, or any other disease of interest. The NHANES dataset is large with a complex survey design, but in the end 6 key risk factors were identified along with numerous other interesting findings.

Overall, the risk factors with the greatest risk were age, hypertension (high blood pressure) and congestive heart failure — all of which in- creased the risk of contracting AF by 2.5-3x. Thyroid condition, obesity, and White Non-Hispanic ethnicity were also significant — increasing the risk of AF by around 1.5 times — but obesity was only significant in the relative risk analysis, not the regression.

When performing the regression analysis, the effect of adjusting for certain risk factors was also investigated. When adjusting for age, the only remaining significant variables were the other two high risk risk factors: congestive heart failure and high blood pressure. The same thing was observed when adjusting for congestive heart failure and high blood pressure, only the other two were significant.

One of the most important results from this project was the development of a computational framework in the open source language R for considering a database and disease of interest (not limited to AF) and identifying important risk factors for this disease through two independent modes of analysis.

The dataset can be downloaded here and the R script here.

Josh Atwal is among a group of students who took part in the summer of research programme funded by Precision Driven Health. While at an elementary stage and considered to be a ‘proof of concept’, these summer projects offer fresh insights into what the world of healthcare will look like when precision medicine is fully implemented.