We will work with HANES data set. Information on HANES can be found here. We will first import the curated HANES data set in RStudio:
# Load the package RCurl
library(RCurl)
# Import the HANES data set from GitHub; break the string into two for readability
# (Please note this readability aspect very carefully)
URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
URL_text_2 <- "/master/Datasets/HANES/NYC_HANES_DIAB.csv"
# Paste it to constitute a single URL
URL <- paste(URL_text_1,URL_text_2, sep="")
HANES <- read.csv(text=getURL(URL))
# Rename the GENDER factor for identification
HANES$GENDER <- factor(HANES$GENDER, labels=c("M","F"))
# Rename the AGEGROUP factor for identification
HANES$AGEGROUP <- factor(HANES$AGEGROUP, labels=c("20-39","40-59","60+"))
# Rename the HSQ_1 factor for identification
HANES$HSQ_1 <- factor(HANES$HSQ_1, labels=c("Excellent","Very Good","Good", "Fair", "Poor"))
# Rename the DX_DBTS as a factor
HANES$DX_DBTS <- factor(HANES$DX_DBTS, labels=c("DIAB","DIAB NO_DX","NO DIAB"))
# Omit all NA from the data frame
HANES <- na.omit(HANES)
# Observe the structure
str(HANES)
## 'data.frame': 1112 obs. of 23 variables:
## $ KEY : Factor w/ 1527 levels "133370A","133370B",..: 28 43 44 53 55 70 84 90 100 107 ...
## $ GENDER : Factor w/ 2 levels "M","F": 1 1 1 1 1 1 1 1 1 1 ...
## $ SPAGE : int 29 28 27 24 30 26 31 32 34 32 ...
## $ AGEGROUP : Factor w/ 3 levels "20-39","40-59",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ HSQ_1 : Factor w/ 5 levels "Excellent","Very Good",..: 2 2 2 1 1 3 1 2 1 3 ...
## $ UCREATININE : int 105 53 314 105 163 150 46 36 177 156 ...
## $ UALBUMIN : num 0.707 1 8 4 3 2 2 0.707 4 3 ...
## $ UACR : num 0.00673 2 3 4 2 ...
## $ MERCURYU : num 0.37 0.106 0.487 2.205 0.979 ...
## $ DX_DBTS : Factor w/ 3 levels "DIAB","DIAB NO_DX",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ A1C : num 5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 5.2 ...
## $ CADMIUM : num 0.2412 0.1732 0.0644 0.0929 0.1202 ...
## $ LEAD : num 1.454 1.019 0.863 1.243 0.612 ...
## $ MERCURYTOTALBLOOD: num 2.34 2.57 1.32 14.66 2.13 ...
## $ HDL : int 42 51 42 61 52 50 57 56 42 44 ...
## $ CHOLESTEROLTOTAL : int 184 157 145 206 120 155 156 235 156 120 ...
## $ GLUCOSESI : num 4.61 4.77 5.16 5 5.11 ...
## $ CREATININESI : num 74.3 73 80 84.9 66 ...
## $ CREATININE : num 0.84 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 1.09 ...
## $ TRIGLYCERIDE : int 156 43 108 65 51 29 31 220 82 35 ...
## $ GLUCOSE : int 83 86 93 90 92 85 72 87 96 92 ...
## $ COTININE : num 31.5918 0.0635 0.035 0.0514 0.035 ...
## $ LDLESTIMATE : int 111 97 81 132 58 99 93 135 98 69 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:415] 2 15 16 24 26 28 33 34 35 39 ...
## .. ..- attr(*, "names")= chr [1:415] "2" "15" "16" "24" ...
A tibble, or tbl_df
, is a modern reimagining of the data frame, keeping what has been proven to be effective, and throwing out what is not. To convert a data frame to a tibble, we can use the function as.tibble(df)
where df is a data frame. We will convert the HANES data frame into tibble and use the tibble from now on.
# Load the tidyverse library
library(tidyverse)
# Convert HANES data frame into a tibble and observe it
HANES_TIB <- as.tibble(HANES)
HANES_TIB
## # A tibble: 1,112 x 23
## KEY GENDER SPAGE AGEGROUP HSQ_1 UCREATININE UALBUMIN UACR MERCURYU
## * <fct> <fct> <int> <fct> <fct> <int> <dbl> <dbl> <dbl>
## 1 1340… M 29 20-39 Very… 105 0.707 0.00673 0.370
## 2 1344… M 28 20-39 Very… 53 1.00 2.00 0.106
## 3 1344… M 27 20-39 Very… 314 8.00 3.00 0.487
## 4 1346… M 24 20-39 Exce… 105 4.00 4.00 2.21
## 5 1346… M 30 20-39 Exce… 163 3.00 2.00 0.979
## 6 1352… M 26 20-39 Good 150 2.00 1.00 1.48
## 7 1354… M 31 20-39 Exce… 46 2.00 4.00 0.106
## 8 1357… M 32 20-39 Very… 36 0.707 0.0196 0.238
## 9 1360… M 34 20-39 Exce… 177 4.00 2.00 2.30
## 10 1362… M 32 20-39 Good 156 3.00 2.00 1.51
## # ... with 1,102 more rows, and 14 more variables: DX_DBTS <fct>,
## # A1C <dbl>, CADMIUM <dbl>, LEAD <dbl>, MERCURYTOTALBLOOD <dbl>,
## # HDL <int>, CHOLESTEROLTOTAL <int>, GLUCOSESI <dbl>,
## # CREATININESI <dbl>, CREATININE <dbl>, TRIGLYCERIDE <int>,
## # GLUCOSE <int>, COTININE <dbl>, LDLESTIMATE <int>
When we take the logrithm of the variables A1C and UACR, we notice there are two clusters -
# Make a ggplot for the log(A1C) and log(UACR) variables with asthetic color for the variable DX_DBTS
ggplot(data = HANES_TIB) +
geom_point(mapping = aes(x = log(A1C), y = log(UACR), color=DX_DBTS))
These two clusters are primarily composed of non-diabetic people. In the set of all non-diabetic people, if we call the lower cluster (i.e, log(UACR) <= -2
) as LOW-UACR and the upper cluster (i.e, log(UACR) >= -2
) as HIGH-UACR, we would like to construct classification trees to predict these clusters using all variables except UACR. Therefore, we will classify them into these two classes -
# We can use the mutate function to make this class label and remove non-necessary variables
mydata <- HANES_TIB %>% filter(DX_DBTS == "NO DIAB") %>%
mutate(Cluster = ifelse(log(UACR) <= -2, 'LU', 'HU')) %>%
select(everything(), -KEY, -GENDER, -AGEGROUP, -HSQ_1, -DX_DBTS)
# Change names to smaller ones for tree labling purposes
library(data.table)
setnames(mydata, old = c('MERCURYTOTALBLOOD','CREATININESI', 'CHOLESTEROLTOTAL',
'LDLESTIMATE', 'CADMIUM', 'MERCURYU', 'TRIGLYCERIDE'),
new=c('MERTOTBLD','CREATSI', 'CHOLTOT', 'LDLE', 'CAD', 'MERU', 'TRIG'))
The tree
library can be used to construct classification and regression trees. We can use the tree()
function to fit a classification tree in order to predict the Cluster variable:
# Load the tree library
library(tree)
# We need to convert the Cluster class into factor to use the tree function
mydata$Cluster <- as.factor(mydata$Cluster)
# Use the tree function to construct the classification tree removing the UACR and
# variables UALBUMIN, UCREATININE that make up UACR
tree.HANES <- tree(Cluster~.-UACR-UALBUMIN-UCREATININE, data = mydata)
# The summary() function lists the variables that are used as internal nodes
# in the tree, the number of terminal nodes, and the training error rate
summary(tree.HANES)
##
## Classification tree:
## tree(formula = Cluster ~ . - UACR - UALBUMIN - UCREATININE, data = mydata)
## Variables actually used in tree construction:
## [1] "MERU" "LDLE" "MERTOTBLD" "CREATSI" "CAD"
## Number of terminal nodes: 10
## Residual mean deviance: 0.6167 = 596.3 / 967
## Misclassification error rate: 0.1218 = 119 / 977
We see that the training error is 12%. For classification trees, the reported deviance is given by,
\[ -2 \sum_{m}\sum_{k}n_{mk}\log \hat{p}_{mk}\]
where \(n_{mk}\) is the number of observations in the \(m\)th terminal node that belong to the \(k\)th class. A small deviance indicates a good fit. The residual mean deviance is simply the deviance divided by \(n-|T_{0}|\), which in this case is \(977-10\). The plot()
and text()
functions can be handy in plotting the tree and viewing it -
# Plot the tree and label it
plot(tree.HANES)
text(tree.HANES, use.n=TRUE, all=TRUE, cex=.6)
We should technically use the test error instead of the training error. We will split the observations into a training set and a test set, build the tree using the training set, and evaluate its performance on the test data. The predict()
function can be used for this purpose. In the case of a classification tree, the argument type="class"
instructs R to return the actual class prediction.
# We will set seed and do a training on 700 data points
set.seed (2)
train <- sample(1:nrow(mydata), 700)
# Form the test data by subtracting the training data indices
test <- setdiff(seq(1,nrow(mydata)), train)
# Make HANES test data
HANES.test <- mydata[test, ]
# Form the cluster data for tabulating the accuracy
Cluster.test <- select(mydata, Cluster)
Cluster.test <- Cluster.test[test, ]
Cluster.test <- as.character(t(Cluster.test))
# Make the tree on the training data
tree.mydata <- tree(Cluster~.-UACR-UALBUMIN-UCREATININE, mydata, subset=train)
# Use predict() function to make a prediction
tree.pred <- predict(tree.mydata, HANES.test, type="class")
# Form the accuracy table/confusion matrix
table(tree.pred,Cluster.test)
## Cluster.test
## tree.pred HU LU
## HU 227 25
## LU 20 5
We see the accuracy is (227+5)/277 = 0.83
or 83%.
Next we will verify pruning the tree improves the accuracy. We can use the function cv.tree()
that will perform cross-validation to determine the optimal tree complexity. The function reports the number of terminal nodes of each tree considered (size) as well as the corresponding error rate and the value of the cost-complexity parameter.
# Set the seed
set.seed(3)
# Use cv.tree() function to do cross-validation to prune and find the optimal tree
cv.mydata <- cv.tree(tree.mydata, FUN=prune.misclass)
# List and show the cross-validated data frame returned by the cv.tree() function
names(cv.mydata)
## [1] "size" "dev" "k" "method"
cv.mydata
## $size
## [1] 27 22 19 15 9 6 5 1
##
## $dev
## [1] 115 116 115 114 100 100 102 95
##
## $k
## [1] -Inf 0.0000000 0.3333333 0.7500000 1.5000000 1.6666667 2.0000000
## [8] 3.2500000
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
# Plot the deviance vs. the size of the tree and the pruning parameter k/alpha
par(mfrow=c(1,2))
plot(cv.mydata$size, cv.mydata$dev, type="b")
plot(cv.mydata$k, cv.mydata$dev, type="b")
# We can use this data to find the misclassification rate
prune.mydata=prune.misclass(tree.mydata,best=6)
# Plot the pruned tree
plot(prune.mydata)
text(prune.mydata,cex=.6)
# Verify if the accuracy increases on the pruned tree by applying the test data
tree.pred=predict(prune.mydata, HANES.test, type="class")
table(tree.pred,Cluster.test)
## Cluster.test
## tree.pred HU LU
## HU 239 26
## LU 8 4
We see the test accuracy with the pruned tree is (239+4)/277 = 0.87
which is 87%. Not only the optimization through cross-validation has increased the accuracy but also made a more interpretatble tree.