Data science fundamentals 04: Exploratory data analysis (Oct. 03, 2017)

Recorded Stream

This is the concluding part of EDA workshop. Import the data set and change data types to reflect their correct annotation (housekeeping).

  # Load the package RCurl
  library(RCurl)
  # Import the HANES data set from GitHub; break the string into two for readability
  # (Please note this readability aspect very carefully)
  URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
  URL_text_2 <- "/master/Datasets/HANES/NYC_HANES_DIAB.csv"
  # Paste it to constitute a single URL 
  URL <- paste(URL_text_1,URL_text_2, sep="")
  HANES <- read.csv(text=getURL(URL))
  # Rename the GENDER factor for identification
  HANES$GENDER <- factor(HANES$GENDER, labels=c("M","F"))
  # Rename the AGEGROUP factor for identification
  HANES$AGEGROUP <- factor(HANES$AGEGROUP, labels=c("20-39","40-59","60+"))
  # Rename the HSQ_1 factor for identification
  HANES$HSQ_1 <- factor(HANES$HSQ_1, labels=c("Excellent","Very Good","Good", "Fair", "Poor"))
  # Rename the DX_DBTS as a factor
  HANES$DX_DBTS <- factor(HANES$DX_DBTS, labels=c("DIAB","DIAB NO_DX","NO DIAB"))
  # Omit all NA from the data frame
  HANES <- na.omit(HANES)
  # Observe the structure
  str(HANES)

## 'data.frame':    1112 obs. of  23 variables:
##  $ KEY              : Factor w/ 1527 levels "133370A","133370B",..: 28 43 44 53 55 70 84 90 100 107 ...
##  $ GENDER           : Factor w/ 2 levels "M","F": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SPAGE            : int  29 28 27 24 30 26 31 32 34 32 ...
##  $ AGEGROUP         : Factor w/ 3 levels "20-39","40-59",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ HSQ_1            : Factor w/ 5 levels "Excellent","Very Good",..: 2 2 2 1 1 3 1 2 1 3 ...
##  $ UCREATININE      : int  105 53 314 105 163 150 46 36 177 156 ...
##  $ UALBUMIN         : num  0.707 1 8 4 3 2 2 0.707 4 3 ...
##  $ UACR             : num  0.00673 2 3 4 2 ...
##  $ MERCURYU         : num  0.37 0.106 0.487 2.205 0.979 ...
##  $ DX_DBTS          : Factor w/ 3 levels "DIAB","DIAB NO_DX",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ A1C              : num  5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 5.2 ...
##  $ CADMIUM          : num  0.2412 0.1732 0.0644 0.0929 0.1202 ...
##  $ LEAD             : num  1.454 1.019 0.863 1.243 0.612 ...
##  $ MERCURYTOTALBLOOD: num  2.34 2.57 1.32 14.66 2.13 ...
##  $ HDL              : int  42 51 42 61 52 50 57 56 42 44 ...
##  $ CHOLESTEROLTOTAL : int  184 157 145 206 120 155 156 235 156 120 ...
##  $ GLUCOSESI        : num  4.61 4.77 5.16 5 5.11 ...
##  $ CREATININESI     : num  74.3 73 80 84.9 66 ...
##  $ CREATININE       : num  0.84 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 1.09 ...
##  $ TRIGLYCERIDE     : int  156 43 108 65 51 29 31 220 82 35 ...
##  $ GLUCOSE          : int  83 86 93 90 92 85 72 87 96 92 ...
##  $ COTININE         : num  31.5918 0.0635 0.035 0.0514 0.035 ...
##  $ LDLESTIMATE      : int  111 97 81 132 58 99 93 135 98 69 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:415] 2 15 16 24 26 28 33 34 35 39 ...
##   .. ..- attr(*, "names")= chr [1:415] "2" "15" "16" "24" ...

  # Load the tidyverse library
  library(tidyverse)

Missing values

It may be tempting to ignore unsual values, but it is best to replace unusual values by NA. This can be done by ifelse() function:

  # Input HANSES
  HANES_repl_unusual <- HANES %>% 
  # Replace using ifelse
  mutate(CREATININE = ifelse(CREATININE <= 0.42 & CREATININE > 0.375, NA, CREATININE))

ifelse() has three arguments. The first argument test should be a logical vector. The result will contain the value of the second argument, yes, when test is TRUE, and the value of the third argument, no, when it is false.

Sometimes NA could just mean no other option which we may want to compare with the rest of the options. For example, the HANES data set was collected at different times. Some of them who did not have their A1C levels measured may have GLUCOSE levels measured. We may want to compare, say, GLUCOSE levels of people with no A1C information with the levels for the people with A1C information.

# Load the package RCurl
  library(RCurl)
  # Import the HANES data set from GitHub; break the string into two for readability
  # (Please note this readability aspect very carefully)
  URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
  URL_text_2 <- "/master/Datasets/HANES/HANES.original.csv"
  # Paste it to constitute a single URL 
  URL <- paste(URL_text_1,URL_text_2, sep="")
  HANES_with_NA <- read.csv(text=getURL(URL))

And here is the code to do this -

  # From the HANES_with_NA dataset
  HANES_with_NA %>% 
  # Find the glucose levels of people with no A1C through the
  # logical operator/function is.na()
  mutate(no_A1C = is.na(A1C),
         glucose_level = GLUCOSE) %>% 
  # Now plot the result for the logical categories
  ggplot(mapping = aes(glucose_level)) + 
    geom_freqpoly(mapping = aes(colour = no_A1C), binwidth = 1/4)

We immediately infer that there are only a few such cases (to our delight)!

Covariation

Variation describes the behavior within a variable, covariation describes the behavior between variables.

Covariation is the tendency for the values of two or more variables to vary together in a related way.

The best way to spot covariation is to visualise the relationship between two or more variables.

First we will consider the variation between continuous and categorical variables.

Variation between continuous and categorical variable

One can plot the frequency distribution between continuous and categorical variable. For example, in the HANES data set, let us plot the frequency distribution of mercury in urine (MERCURYU variable) for different age groups.

  # Plot the frequency distribution of mercury for different age groups
  ggplot(data = HANES, mapping = aes(x = MERCURYU)) + 
    geom_freqpoly(mapping = aes(colour = AGEGROUP), binwidth = 1)

This plot that says people in age groups 20-39 is the largest group with average mercury in urine, followed by age group 40-59 and then by 60+, is misleading because the number of people in the group might differ.

This is the case as we can see:

  # Find the distribution of people in each age group
  ggplot(HANES) + 
    geom_bar(mapping = aes(x = AGEGROUP))

Number of people in 20-39 > # of people in 40-59 > # of people in 60+.

To make the comparison easier we need to swap what is displayed on the y-axis.

Instead of displaying count, we’ll display density, which is the count standardised so that the area under each frequency polygon is one.

  # Plot density instead of count for MERCURYU variable
  ggplot(data = HANES, mapping = aes(x = MERCURYU, y = ..density..)) + 
    geom_freqpoly(mapping = aes(colour = AGEGROUP), binwidth = 1)

We can right away see that all groups have equal number of people with average mercury-in-urine levels. In contrast, if we plot the density of blood hemoglobin (A1C) levels for these age groups, we see:

the group 20-39 has the smallest variation in blood hemoglobin levels.

Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot. A boxplot is a type of visual shorthand for a distribution of values that is popular among statisticians. Each boxplot consists of:

A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.
Visual points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points are unusual so are plotted individually.
A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution.

We can look at the variation between A1C and AGEGROUP through a box plot.

  # Make a box plot for A1C and AGEGROUP
  ggplot(data = HANES, mapping = aes(x = AGEGROUP, y = A1C)) +
    geom_boxplot()

Thus the box plot confirms our observation that the group 20-39 has the smallest variation in blood hemoglobin levels. AGEGROUP is an ordinal (meaning which can be ordered). Some categorical variables such as health status (HSQ_1) cannot be ordered and understanding the trend might be difficult.

  # Make a box plot for HDL and HSQ_1
  ggplot(data = HANES, mapping = aes(x = HSQ_1, y = HDL)) +
    geom_boxplot()

For such comparisons we can make use of reorder argument, say, based on median:

  # Make a box plot for HDL and HSQ_1 based on reordering by median values
  ggplot(data = HANES, mapping = aes(x = reorder(HSQ_1, HDL, FUN = median), y = HDL)) +
    geom_boxplot()

Such reordering helps us to see that average HDL increases as the health status becomes better.

If we have long variable names, geom_boxplot() will work better if you flip it 90°. You can do that with coord_flip().

  # Make a box plot for HDL and HSQ_1 based on reordering by median values
  # and flip coordinates
  ggplot(data = HANES, mapping = aes(x = reorder(HSQ_1, HDL, FUN = median), y = HDL)) +
    geom_boxplot() + coord_flip()

Variation between two categorical variables

To visualise the covariation between categorical variables we can rely on the built-in geom_count() function:

  # Plot two categorical variables AGEGROUP and diabetes status DX_DBTS
  # using geom_count
  ggplot(data = HANES) +
    geom_count(mapping = aes(x = AGEGROUP, y = DX_DBTS))

The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specific x values and specific y values.

We can make a more continuous map using the geom_tile() function:

  # Count the people in age group and health status
  HANES %>% 
  count(AGEGROUP, HSQ_1) %>%  
  ggplot(mapping = aes(x = AGEGROUP, y = HSQ_1)) +
  # and use the geom_tile to fill in the aesthetics
    geom_tile(mapping = aes(fill = n))

Variation between two continuous variables

Covariation between two continuous variables can be viewed using geom_point() function:

  # Plot the variation between LDLESTIMATE and CHOLESTEROLTOTAL
  ggplot(data = HANES) +
    geom_point(mapping = aes(x = LDLESTIMATE, y = CHOLESTEROLTOTAL))

We see the near linear relationship of LDL and cholesterol levels in people.

Scatterplots become less useful as the size of the dataset grows, because points begin to overplot, and pile up into areas of uniform black (as above). One way to fix the problem is using the alpha aesthetic to add transparency.

  # Plot the variation between LDLESTIMATE and CHOLESTEROLTOTAL 
  # adding alpha aesthetics
  ggplot(data = HANES) +
    geom_point(mapping = aes(x = LDLESTIMATE, y = CHOLESTEROLTOTAL), alpha = 1 / 5)

But using transparency can be challenging for very large datasets. Another solution is to use bin.

geom_bin2d() and geom_hex() divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin.

geom_bin2d() creates rectangular bins.

geom_hex() creates hexagonal bins.

Note: We need to install hexbin package to use geom_hex().

  # Plot the variation between LDLESTIMATE and CHOLESTEROLTOTAL 
  # using geom_bin2d
  ggplot(data = HANES) +
    geom_bin2d(mapping = aes(x = LDLESTIMATE, y = CHOLESTEROLTOTAL))

  # install.packages("hexbin")
  # Plot the variation between LDLESTIMATE and CHOLESTEROLTOTAL 
  # using geom_hex
  ggplot(data = HANES) +
    geom_hex(mapping = aes(x = LDLESTIMATE, y = CHOLESTEROLTOTAL))

Another option is to bin one continuous variable so it acts like a categorical variable. Then we can use one of the techniques for visualising the combination of a categorical and a continuous variable that we learned.

For example, we could bin LDLESTIMATE and then for each group, display a boxplot with variable width representing the number of points in the category. This can be accomplished by the argument cut_width() available as an aesthetic along with the varwidth argument available in the geom geom_boxplot. cut_width(x, width) divides x into bins of width width while varwidth = TRUE defines variable width for the box plot depending on the count of the points in the bin.

  # Bin LDLESTIMATE using cut_width argument and do a box plot like we 
  # did between categorical and continuous variable allowing the size 
  # of the box plot represent the number of points - varwidth argument
  ggplot(data = HANES, mapping = aes(x = LDLESTIMATE, y = CHOLESTEROLTOTAL)) + 
    geom_boxplot(mapping = aes(group = cut_width(LDLESTIMATE, 20)), varwidth = TRUE)

Another approach would be to display the same number of points for each bin which can be defined by the argument cut_number.

  # Use the same number of points for LDLESTIMATE using cut_number argument 
  # and do a box plot like we did between categorical and continuous variable 
  ggplot(data = HANES, mapping = aes(x = LDLESTIMATE, y = CHOLESTEROLTOTAL)) + 
    geom_boxplot(mapping = aes(group = cut_number(LDLESTIMATE, 20)))

Patterns and models

Relationships in data can be determined when we observe patterns. Systematic relationship between two variables will result in a pattern. The following are key questions to ask when we observe patterns:

Could the pattern be due to random chance?
If not how can we describe the relationship implied by the pattern?
How strong is the relationship implied by the pattern?
What other variables might affect the relationship?
Does the relationship change if we look at individual subgroups of the data?

For example, we already saw:

  # Make a ggplot out of log(A1C) and log(UACR) variables
  ggplot(data = HANES) + 
    geom_point(mapping = aes(x = log(A1C), y = log(UACR)))

Patterns provide one of the most useful tools for data scientists because they reveal covariation.

If we think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it.

If two variables covary, we can use the values of one variable to make better predictions about the values of the second.

If the covariation is due to a causal relationship (a special case), then we can use the value of one variable to control the value of the second.

Models are a tool for extracting patterns out of data which we will cover in more detail later.

For example, in HANES, it’s hard to understand the relationship between health status (HSQ_1) and total cholesterol, because health status and LDL cholesterol, and LDL cholesterol and total cholesterol are tightly related. It’s possible to use a model to remove the very strong relationship between total cholesterol and LDL cholesterol so we can explore the subtleties that remain. The following code fits a model that predicts total cholesterol from LDL cholesterol and then computes the residuals (the difference between the predicted value and the actual value). The residuals give us a view of the total cholesterol of the patients, once the effect of LDL cholesterol has been removed.

  # Load the modelr library
  library(modelr)
  # Compute the model using the `lm` - linear regression function
  mod <- lm(CHOLESTEROLTOTAL ~ LDLESTIMATE , data = HANES)
  # add the residuals to the data and name it as resid
  HANES_mod <- HANES %>% 
    add_residuals(mod) %>% 
      mutate(resid = resid)
  # Plot the total cholesterol and residuals
  ggplot(data = HANES_mod) + 
    geom_point(mapping = aes(x = CHOLESTEROLTOTAL, y = resid))
  # Boxplot the health status and residuals that gives the 
  # effect of total cholesterol removing the effects of LDL
  # and reorder by the median
  ggplot(data = HANES_mod) + 
    geom_boxplot(mapping = aes(x = reorder(HSQ_1, resid, FUN = median), y = resid))

This exercise shows people with “Good” and “Very Good” health have low total cholesterol compared to people with poor health.

Selected materials and references

R for Data Science - Exploratory Data Analysis Part