In this workshop we will do exploratory data analysis (EDA) with the HANES data set. Import the data set and change data types to reflect their correct annotation (housekeeping).
# Load the package RCurl
library(RCurl)
# Import the HANES data set from GitHub; break the string into two for readability
# (Please note this readability aspect very carefully)
URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
URL_text_2 <- "/master/Datasets/HANES/NYC_HANES_DIAB.csv"
# Paste it to constitute a single URL
URL <- paste(URL_text_1,URL_text_2, sep="")
HANES <- read.csv(text=getURL(URL))
# Rename the GENDER factor for identification
HANES$GENDER <- factor(HANES$GENDER, labels=c("M","F"))
# Rename the AGEGROUP factor for identification
HANES$AGEGROUP <- factor(HANES$AGEGROUP, labels=c("20-39","40-59","60+"))
# Rename the HSQ_1 factor for identification
HANES$HSQ_1 <- factor(HANES$HSQ_1, labels=c("Excellent","Very Good","Good", "Fair", "Poor"))
# Rename the DX_DBTS as a factor
HANES$DX_DBTS <- factor(HANES$DX_DBTS, labels=c("DIAB","DIAB NO_DX","NO DIAB"))
# Omit all NA from the data frame
HANES <- na.omit(HANES)
# Observe the structure
str(HANES)
# Load the tidyverse library
library(tidyverse)
## 'data.frame': 1112 obs. of 23 variables:
## $ KEY : Factor w/ 1527 levels "133370A","133370B",..: 28 43 44 53 55 70 84 90 100 107 ...
## $ GENDER : Factor w/ 2 levels "M","F": 1 1 1 1 1 1 1 1 1 1 ...
## $ SPAGE : int 29 28 27 24 30 26 31 32 34 32 ...
## $ AGEGROUP : Factor w/ 3 levels "20-39","40-59",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ HSQ_1 : Factor w/ 5 levels "Excellent","Very Good",..: 2 2 2 1 1 3 1 2 1 3 ...
## $ UCREATININE : int 105 53 314 105 163 150 46 36 177 156 ...
## $ UALBUMIN : num 0.707 1 8 4 3 2 2 0.707 4 3 ...
## $ UACR : num 0.00673 2 3 4 2 ...
## $ MERCURYU : num 0.37 0.106 0.487 2.205 0.979 ...
## $ DX_DBTS : Factor w/ 3 levels "DIAB","DIAB NO_DX",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ A1C : num 5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 5.2 ...
## $ CADMIUM : num 0.2412 0.1732 0.0644 0.0929 0.1202 ...
## $ LEAD : num 1.454 1.019 0.863 1.243 0.612 ...
## $ MERCURYTOTALBLOOD: num 2.34 2.57 1.32 14.66 2.13 ...
## $ HDL : int 42 51 42 61 52 50 57 56 42 44 ...
## $ CHOLESTEROLTOTAL : int 184 157 145 206 120 155 156 235 156 120 ...
## $ GLUCOSESI : num 4.61 4.77 5.16 5 5.11 ...
## $ CREATININESI : num 74.3 73 80 84.9 66 ...
## $ CREATININE : num 0.84 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 1.09 ...
## $ TRIGLYCERIDE : int 156 43 108 65 51 29 31 220 82 35 ...
## $ GLUCOSE : int 83 86 93 90 92 85 72 87 96 92 ...
## $ COTININE : num 31.5918 0.0635 0.035 0.0514 0.035 ...
## $ LDLESTIMATE : int 111 97 81 132 58 99 93 135 98 69 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:415] 2 15 16 24 26 28 33 34 35 39 ...
## .. ..- attr(*, "names")= chr [1:415] "2" "15" "16" "24" ...
EDA is informally a set of procedures where we try to obtain directed insights about the data. They may comprise of the following steps:
Generate questions about the data.
Search for answers by visualising, transforming, and modelling the data.
Use the information to generate new questions.
It is natural for the same data to vary between measurements.
There are several types of variations, including batch effects. Variations gives raise to distributions.
There are ways to visualize distributions and they may depend on whether the distribution has categorical values or continuous.
To examine the distribution of a categorical variable, we can use a geom_bar()
function:
# Find the distribution of people with diabetes diagnostic status
ggplot(data = HANES) +
geom_bar(mapping = aes(x = DX_DBTS))
Manually, these values can be found with the count()
function.
# Manually find the counts
HANES %>% count(DX_DBTS)
## # A tibble: 3 x 2
## DX_DBTS n
## <fctr> <int>
## 1 DIAB 104
## 2 DIAB NO_DX 31
## 3 NO DIAB 977
For continous variables (such as GLUCOSE, for example), we can use the histogram geom:
# Find the distribution of the GLUCOSE levels in patients
ggplot(data = HANES) +
geom_histogram(mapping = aes(x = GLUCOSE))
Multiple histogram plots can be overlayed by the geom freqpoly()
.
# Find the distribution of the GLUCOSE levels in different age groups
ggplot(data = HANES, mapping = aes(x = GLUCOSE, colour = AGEGROUP)) +
geom_freqpoly()
In both bar charts and histograms, tall bars show the common values of a variable, and shorter bars show less-common values. Places that do not have bars reveal values that are not seen in this data. To turn this information into useful questions, we have to look for unexpected things:
Which values are the most common? Why? Are they typical?
Which values are rare? Why? Does that match your expectations?
Can we see any unusual patterns? What might explain them?
As an example, the histogram below suggests an interesting question:
# Find the distribution of the HDL levels in different health status
ggplot(data = HANES, mapping = aes(x = HDL, colour = HSQ_1)) +
geom_freqpoly()
Why poor health has lesser mean HDL?
Consider the following histogram:
# Find the distribution of log(UACR) for different age groups
ggplot(data = HANES, mapping = aes(x = log(UACR), colour = AGEGROUP)) +
geom_freqpoly()
Bimodality suggests that subgroups exist in our data. To understand the subgroups, we need to ask:
How are the observations within each cluster similar to each other, in terms of other variables?
How are the observations in separate clusters different from each other, in terms of other variables?
Do the clusters have biological significance?
Why might the appearance of clusters be misleading?
Outliers can be extremely difficult to spot.
Sometimes outliers are data entry errors; other times outliers suggest important new science.
When we have a lot of data, outliers are sometimes difficult to see in a histogram.
For example, take the distribution of the CREATININE variable from our HANES dataset.
# Find the distribution of the CREATININE levels in patients
ggplot(data = HANES) +
geom_histogram(mapping = aes(x = COTININE))
We can hardly spot any outlier.
The only evidence of outliers is the unusually wide limits on the x-axis.
However, zooming the x-axis by restricting the limits between 0 and 0.5 we see disconnected values suggesting the presense of outliers.
# Find the distribution of the CREATININE levels in patients
# restricted to the limits 0 to 0.5
ggplot(data = HANES) +
geom_histogram(mapping = aes(x = COTININE)) + xlim(c(0,0.5))
Once we have visually inspected the outliers, we can isolate them out by data transformation tools discussed in the previous lecture.
# From the HANES data set
unusual <- HANES %>%
# filter people with CREATININE levels between 0.42 and 0.37
filter(CREATININE <= 0.42 & CREATININE > 0.375) %>%
# select only the people id (key) and gender
select(CREATININE, KEY, GENDER) %>%
# and order CREATININE as a first variable
arrange(CREATININE)
# And display it as a tibble
as.tibble(unusual)
## # A tibble: 7 x 3
## CREATININE KEY GENDER
## <dbl> <fctr> <fctr>
## 1 0.38 141520B F
## 2 0.38 139330A F
## 3 0.40 443100A F
## 4 0.40 140660A F
## 5 0.41 223570A F
## 6 0.42 220280A F
## 7 0.42 139150B F
We see that all of these people are females.