Data science fundamentals 03: Exploratory data analysis (Sept. 28, 2017)


Recorded Stream



In this workshop we will do exploratory data analysis (EDA) with the HANES data set. Import the data set and change data types to reflect their correct annotation (housekeeping).

  # Load the package RCurl
  library(RCurl)
  # Import the HANES data set from GitHub; break the string into two for readability
  # (Please note this readability aspect very carefully)
  URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
  URL_text_2 <- "/master/Datasets/HANES/NYC_HANES_DIAB.csv"
  # Paste it to constitute a single URL 
  URL <- paste(URL_text_1,URL_text_2, sep="")
  HANES <- read.csv(text=getURL(URL))
  # Rename the GENDER factor for identification
  HANES$GENDER <- factor(HANES$GENDER, labels=c("M","F"))
  # Rename the AGEGROUP factor for identification
  HANES$AGEGROUP <- factor(HANES$AGEGROUP, labels=c("20-39","40-59","60+"))
  # Rename the HSQ_1 factor for identification
  HANES$HSQ_1 <- factor(HANES$HSQ_1, labels=c("Excellent","Very Good","Good", "Fair", "Poor"))
  # Rename the DX_DBTS as a factor
  HANES$DX_DBTS <- factor(HANES$DX_DBTS, labels=c("DIAB","DIAB NO_DX","NO DIAB"))
  # Omit all NA from the data frame
  HANES <- na.omit(HANES)
  # Observe the structure
  str(HANES)
## 'data.frame':    1112 obs. of  23 variables:
##  $ KEY              : Factor w/ 1527 levels "133370A","133370B",..: 28 43 44 53 55 70 84 90 100 107 ...
##  $ GENDER           : Factor w/ 2 levels "M","F": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SPAGE            : int  29 28 27 24 30 26 31 32 34 32 ...
##  $ AGEGROUP         : Factor w/ 3 levels "20-39","40-59",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ HSQ_1            : Factor w/ 5 levels "Excellent","Very Good",..: 2 2 2 1 1 3 1 2 1 3 ...
##  $ UCREATININE      : int  105 53 314 105 163 150 46 36 177 156 ...
##  $ UALBUMIN         : num  0.707 1 8 4 3 2 2 0.707 4 3 ...
##  $ UACR             : num  0.00673 2 3 4 2 ...
##  $ MERCURYU         : num  0.37 0.106 0.487 2.205 0.979 ...
##  $ DX_DBTS          : Factor w/ 3 levels "DIAB","DIAB NO_DX",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ A1C              : num  5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 5.2 ...
##  $ CADMIUM          : num  0.2412 0.1732 0.0644 0.0929 0.1202 ...
##  $ LEAD             : num  1.454 1.019 0.863 1.243 0.612 ...
##  $ MERCURYTOTALBLOOD: num  2.34 2.57 1.32 14.66 2.13 ...
##  $ HDL              : int  42 51 42 61 52 50 57 56 42 44 ...
##  $ CHOLESTEROLTOTAL : int  184 157 145 206 120 155 156 235 156 120 ...
##  $ GLUCOSESI        : num  4.61 4.77 5.16 5 5.11 ...
##  $ CREATININESI     : num  74.3 73 80 84.9 66 ...
##  $ CREATININE       : num  0.84 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 1.09 ...
##  $ TRIGLYCERIDE     : int  156 43 108 65 51 29 31 220 82 35 ...
##  $ GLUCOSE          : int  83 86 93 90 92 85 72 87 96 92 ...
##  $ COTININE         : num  31.5918 0.0635 0.035 0.0514 0.035 ...
##  $ LDLESTIMATE      : int  111 97 81 132 58 99 93 135 98 69 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:415] 2 15 16 24 26 28 33 34 35 39 ...
##   .. ..- attr(*, "names")= chr [1:415] "2" "15" "16" "24" ...
  # Load the tidyverse library
  library(tidyverse)

EDA is informally a set of procedures where we try to obtain directed insights about the data. They may comprise of the following steps:

  1. Generate questions about the data.

  2. Search for answers by visualising, transforming, and modelling the data.

  3. Use the information to generate new questions.

Visualizing variations

It is natural for the same data to vary between measurements. There are several types of variations, including batch effects. Variations gives raise to distributions. There are ways to visualize distributions and they may depend on whether the distribution has categorical values or continuous. To examine the distribution of a categorical variable, we can use a geom_bar() function:

  # Find the distribution of people with diabetes diagnostic status 
  ggplot(data = HANES) +
    geom_bar(mapping = aes(x = DX_DBTS))

Manually, these values can be found with the count() function.

  # Manually find the counts
  HANES %>% count(DX_DBTS)
## # A tibble: 3 x 2
##      DX_DBTS     n
##       <fctr> <int>
## 1       DIAB   104
## 2 DIAB NO_DX    31
## 3    NO DIAB   977

For continous variables (such as GLUCOSE, for example), we can use the histogram geom:

  # Find the distribution of the GLUCOSE levels in patients
  ggplot(data = HANES) +
    geom_histogram(mapping = aes(x = GLUCOSE))

Multiple histogram plots can be overlayed by the geom freqpoly().

  # Find the distribution of the GLUCOSE levels in different age groups
  ggplot(data = HANES, mapping = aes(x = GLUCOSE, colour = AGEGROUP)) +
    geom_freqpoly()


Typical values

In both bar charts and histograms, tall bars show the common values of a variable, and shorter bars show less-common values. Places that do not have bars reveal values that are not seen in this data. To turn this information into useful questions, we have to look for unexpected things:

  1. Which values are the most common? Why? Are they typical?

  2. Which values are rare? Why? Does that match your expectations?

  3. Can we see any unusual patterns? What might explain them?

As an example, the histogram below suggests an interesting question:

  # Find the distribution of the HDL levels in different health status
  ggplot(data = HANES, mapping = aes(x = HDL, colour = HSQ_1)) +
    geom_freqpoly()

Why poor health has lesser mean HDL?

Consider the following histogram:

  # Find the distribution of log(UACR) for different age groups
  ggplot(data = HANES, mapping = aes(x = log(UACR), colour = AGEGROUP)) +
    geom_freqpoly()

Bimodality suggests that subgroups exist in our data. To understand the subgroups, we need to ask:

  1. How are the observations within each cluster similar to each other, in terms of other variables?

  2. How are the observations in separate clusters different from each other, in terms of other variables?

  3. Do the clusters have biological significance?

  4. Why might the appearance of clusters be misleading?


Unusual values and outlier detection

Outliers can be extremely difficult to spot. Sometimes outliers are data entry errors; other times outliers suggest important new science. When we have a lot of data, outliers are sometimes difficult to see in a histogram. For example, take the distribution of the CREATININE variable from our HANES dataset.

  # Find the distribution of the CREATININE levels in patients
  ggplot(data = HANES) +
    geom_histogram(mapping = aes(x = COTININE))

We can hardly spot any outlier. The only evidence of outliers is the unusually wide limits on the x-axis. However, zooming the x-axis by restricting the limits between 0 and 0.5 we see disconnected values suggesting the presense of outliers.

  # Find the distribution of the CREATININE levels in patients
  # restricted to the limits 0 to 0.5
  ggplot(data = HANES) +
    geom_histogram(mapping = aes(x = COTININE)) + xlim(c(0,0.5)) 

Once we have visually inspected the outliers, we can isolate them out by data transformation tools discussed in the previous lecture.

  # From the HANES data set
  unusual <- HANES %>% 
  # filter people with CREATININE levels between 0.42 and 0.37
  filter(CREATININE <= 0.42 & CREATININE > 0.375) %>% 
  # select only the people id (key) and gender
  select(CREATININE, KEY, GENDER) %>%
  # and order CREATININE as a first variable
  arrange(CREATININE)
  # And display it as a tibble
  as.tibble(unusual) 
## # A tibble: 7 x 3
##   CREATININE     KEY GENDER
##        <dbl>  <fctr> <fctr>
## 1       0.38 141520B      F
## 2       0.38 139330A      F
## 3       0.40 443100A      F
## 4       0.40 140660A      F
## 5       0.41 223570A      F
## 6       0.42 220280A      F
## 7       0.42 139150B      F

We see that all of these people are females.


Selected materials and references

R for Data Science - Exploratory Data Analysis Part