Data science fundamentals 02: Transform and explore (Sept. 26, 2017)

Recorded Stream

We will work with HANES data set. In the spirit of the data transformation section in R for Data Science, we will look into dplyr() functions. As usual, we will first import the curated HANES data set in RStudio:

  # Load the package RCurl
  library(RCurl)
  # Import the HANES data set from GitHub; break the string into two for readability
  # (Please note this readability aspect very carefully)
  URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
  URL_text_2 <- "/master/Datasets/HANES/NYC_HANES_DIAB.csv"
  # Paste it to constitute a single URL 
  URL <- paste(URL_text_1,URL_text_2, sep="")
  HANES <- read.csv(text=getURL(URL))
  # Rename the GENDER factor for identification
  HANES$GENDER <- factor(HANES$GENDER, labels=c("M","F"))
  # Rename the AGEGROUP factor for identification
  HANES$AGEGROUP <- factor(HANES$AGEGROUP, labels=c("20-39","40-59","60+"))
  # Rename the HSQ_1 factor for identification
  HANES$HSQ_1 <- factor(HANES$HSQ_1, labels=c("Excellent","Very Good","Good", "Fair", "Poor"))
  # Rename the DX_DBTS as a factor
  HANES$DX_DBTS <- factor(HANES$DX_DBTS, labels=c("DIAB","DIAB NO_DX","NO DIAB"))
  # Omit all NA from the data frame
  HANES <- na.omit(HANES)
  # Observe the structure
  str(HANES)

## 'data.frame':    1112 obs. of  23 variables:
##  $ KEY              : Factor w/ 1527 levels "133370A","133370B",..: 28 43 44 53 55 70 84 90 100 107 ...
##  $ GENDER           : Factor w/ 2 levels "M","F": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SPAGE            : int  29 28 27 24 30 26 31 32 34 32 ...
##  $ AGEGROUP         : Factor w/ 3 levels "20-39","40-59",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ HSQ_1            : Factor w/ 5 levels "Excellent","Very Good",..: 2 2 2 1 1 3 1 2 1 3 ...
##  $ UCREATININE      : int  105 53 314 105 163 150 46 36 177 156 ...
##  $ UALBUMIN         : num  0.707 1 8 4 3 2 2 0.707 4 3 ...
##  $ UACR             : num  0.00673 2 3 4 2 ...
##  $ MERCURYU         : num  0.37 0.106 0.487 2.205 0.979 ...
##  $ DX_DBTS          : Factor w/ 3 levels "DIAB","DIAB NO_DX",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ A1C              : num  5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 5.2 ...
##  $ CADMIUM          : num  0.2412 0.1732 0.0644 0.0929 0.1202 ...
##  $ LEAD             : num  1.454 1.019 0.863 1.243 0.612 ...
##  $ MERCURYTOTALBLOOD: num  2.34 2.57 1.32 14.66 2.13 ...
##  $ HDL              : int  42 51 42 61 52 50 57 56 42 44 ...
##  $ CHOLESTEROLTOTAL : int  184 157 145 206 120 155 156 235 156 120 ...
##  $ GLUCOSESI        : num  4.61 4.77 5.16 5 5.11 ...
##  $ CREATININESI     : num  74.3 73 80 84.9 66 ...
##  $ CREATININE       : num  0.84 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 1.09 ...
##  $ TRIGLYCERIDE     : int  156 43 108 65 51 29 31 220 82 35 ...
##  $ GLUCOSE          : int  83 86 93 90 92 85 72 87 96 92 ...
##  $ COTININE         : num  31.5918 0.0635 0.035 0.0514 0.035 ...
##  $ LDLESTIMATE      : int  111 97 81 132 58 99 93 135 98 69 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:415] 2 15 16 24 26 28 33 34 35 39 ...
##   .. ..- attr(*, "names")= chr [1:415] "2" "15" "16" "24" ...

We will see more about tibbles in the wrangle section but we will make a short note of it now.

A tibble, or tbl_df, is a modern reimagining of the data frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist). This forces one to confront problems earlier, typically leading to cleaner, more expressive code. Tibbles also have an enhanced print method() which makes them easier to use with large datasets containing complex objects.

To convert a data frame to a tibble, we can use the function as.tibble(df) where df is a data frame. We will convert the HANES data frame into tibble and use the tibble from now on.

  # Load the tidyverse library
  library(tidyverse)
  # Convert HANES data frame into a tibble and observe it
  HANES_TIB <- as.tibble(HANES)
  HANES_TIB

## # A tibble: 1,112 x 23
##        KEY GENDER SPAGE AGEGROUP     HSQ_1 UCREATININE UALBUMIN
##  *  <fctr> <fctr> <int>   <fctr>    <fctr>       <int>    <dbl>
##  1 134040A      M    29    20-39 Very Good         105    0.707
##  2 134460B      M    28    20-39 Very Good          53    1.000
##  3 134490A      M    27    20-39 Very Good         314    8.000
##  4 134620A      M    24    20-39 Excellent         105    4.000
##  5 134670A      M    30    20-39 Excellent         163    3.000
##  6 135210B      M    26    20-39      Good         150    2.000
##  7 135420A      M    31    20-39 Excellent          46    2.000
##  8 135740A      M    32    20-39 Very Good          36    0.707
##  9 136020A      M    34    20-39 Excellent         177    4.000
## 10 136250A      M    32    20-39      Good         156    3.000
## # ... with 1,102 more rows, and 16 more variables: UACR <dbl>,
## #   MERCURYU <dbl>, DX_DBTS <fctr>, A1C <dbl>, CADMIUM <dbl>, LEAD <dbl>,
## #   MERCURYTOTALBLOOD <dbl>, HDL <int>, CHOLESTEROLTOTAL <int>,
## #   GLUCOSESI <dbl>, CREATININESI <dbl>, CREATININE <dbl>,
## #   TRIGLYCERIDE <int>, GLUCOSE <int>, COTININE <dbl>, LDLESTIMATE <int>

Tranformation basics are six key dplyr library functions that will allow us to solve vast majority of data manipulation challenges:

We can pick observations by their values using the function filter().
Reorder the rows applying arrange() operation to a data set.
We can choose variables by their names using select().
Create new variables through mutate().
Find the summary using summarise() function.
And the most important of all which is the group_by() operation which can be used with any of the above five.

The `filter()` function

The filter function allows us to pick a subset of information (or rows) from our dataset.

  # Pick all the records of patients with age == 45 and HDL value of 50
  filter(HANES, SPAGE == 45, HDL == 50)

##       KEY GENDER SPAGE AGEGROUP HSQ_1 UCREATININE UALBUMIN UACR  MERCURYU
## 1 329120A      M    45    40-59  Good          31        2    6 0.1822157
## 2 329270B      F    45    40-59  Good         150        5    3 0.5092553
##   DX_DBTS A1C CADMIUM   LEAD MERCURYTOTALBLOOD HDL CHOLESTEROLTOTAL
## 1 NO DIAB 4.8  0.9012 3.9006            3.2254  50              192
## 2    DIAB 6.8  0.1798 0.8202            0.8664  50              144
##   GLUCOSESI CREATININESI CREATININE TRIGLYCERIDE GLUCOSE COTININE
## 1      4.72           92       1.04          108      85   0.0757
## 2      9.05           57       0.64          133     163   0.3407
##   LDLESTIMATE
## 1         120
## 2          67

Note: All dplyr functions write the result to a new data frame. If we want to store the data, we need to assign the result to a variable.

We can also use comparison and logical operators. For instance,

  # Pick all the records of patients who are between 45 and 46 and HDL value of 50
  filter(HANES, SPAGE >= 45 & SPAGE <= 46, HDL == 50)

##       KEY GENDER SPAGE AGEGROUP HSQ_1 UCREATININE UALBUMIN UACR  MERCURYU
## 1 329120A      M    45    40-59  Good          31        2    6 0.1822157
## 2 550110A      F    46    40-59  Good          54        3    6 0.3860837
## 3 329270B      F    45    40-59  Good         150        5    3 0.5092553
##   DX_DBTS A1C CADMIUM   LEAD MERCURYTOTALBLOOD HDL CHOLESTEROLTOTAL
## 1 NO DIAB 4.8  0.9012 3.9006            3.2254  50              192
## 2 NO DIAB 5.1  0.3584 0.9674            1.2028  50              188
## 3    DIAB 6.8  0.1798 0.8202            0.8664  50              144
##   GLUCOSESI CREATININESI CREATININE TRIGLYCERIDE GLUCOSE COTININE
## 1      4.72           92       1.04          108      85   0.0757
## 2      5.16           59       0.67           77      93   0.0350
## 3      9.05           57       0.64          133     163   0.3407
##   LDLESTIMATE
## 1         120
## 2         123
## 3          67

will list all the records for patients with age between \(45\) and \(46\) and who have the HDL value of \(50\).

Note: filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values.

Classwork/Homework:

Find all patients who

Are females and in between ages \(45\) and \(60\)
Are in excellent or good health status
Who have diabetes but not yet diagnosed
Have total cholesterol between \(150\) and \(200\)
Have glucose levels greater than \(75\)
Who had poor health status but with more than median HDL values
High content of mercury (more than median) in their urine

How many patients have a missing UACR value in the HANES dataset?

The `arrange()` function

arrange() works similar to filter function except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If we provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

  # Arrange the data frame arranging GENDER and AGE with descending order of the variable UCREATININE
  a <- arrange(HANES, GENDER, SPAGE, desc(UCREATININE))
  atib <- as.tibble(a)
  atib

## # A tibble: 1,112 x 23
##        KEY GENDER SPAGE AGEGROUP     HSQ_1 UCREATININE UALBUMIN
##     <fctr> <fctr> <int>   <fctr>    <fctr>       <int>    <dbl>
##  1 134310B      M    20    20-39 Excellent         612   26.000
##  2 550030A      M    20    20-39 Very Good         217   12.000
##  3 220620A      M    20    20-39 Very Good         198   15.000
##  4 333190A      M    20    20-39      Good         153    5.000
##  5 443500A      M    20    20-39 Very Good         148    1.000
##  6 140330B      M    20    20-39 Very Good         132    3.000
##  7 447750A      M    20    20-39 Excellent         131    3.000
##  8 223980B      M    21    20-39 Excellent         269    1.000
##  9 223930B      M    21    20-39 Very Good         205    0.707
## 10 328780B      M    21    20-39 Very Good         144   25.000
## # ... with 1,102 more rows, and 16 more variables: UACR <dbl>,
## #   MERCURYU <dbl>, DX_DBTS <fctr>, A1C <dbl>, CADMIUM <dbl>, LEAD <dbl>,
## #   MERCURYTOTALBLOOD <dbl>, HDL <int>, CHOLESTEROLTOTAL <int>,
## #   GLUCOSESI <dbl>, CREATININESI <dbl>, CREATININE <dbl>,
## #   TRIGLYCERIDE <int>, GLUCOSE <int>, COTININE <dbl>, LDLESTIMATE <int>

Note: Missing values are always sorted at the end.

To view the last few entries of the data frame/tibble, one can use the tail() function.

  tail(atib)

## # A tibble: 6 x 23
##       KEY GENDER SPAGE AGEGROUP  HSQ_1 UCREATININE UALBUMIN        UACR
##    <fctr> <fctr> <int>   <fctr> <fctr>       <int>    <dbl>       <dbl>
## 1 332900A      F    88      60+   Fair         107   11.000 10.00000000
## 2 224220A      F    88      60+   Fair          89    8.000  9.00000000
## 3 446510A      F    88      60+   Fair          60    0.707  0.01178333
## 4 138450B      F    89      60+   Good         181    7.000  3.87000000
## 5 442720A      F    90      60+   Good         309   59.000 19.00000000
## 6 140950A      F    92      60+   Good          73   15.000 20.00000000
## # ... with 15 more variables: MERCURYU <dbl>, DX_DBTS <fctr>, A1C <dbl>,
## #   CADMIUM <dbl>, LEAD <dbl>, MERCURYTOTALBLOOD <dbl>, HDL <int>,
## #   CHOLESTEROLTOTAL <int>, GLUCOSESI <dbl>, CREATININESI <dbl>,
## #   CREATININE <dbl>, TRIGLYCERIDE <int>, GLUCOSE <int>, COTININE <dbl>,
## #   LDLESTIMATE <int>

Classwork/Homework:

Sort HANES to find patients with highest cholesterol. Do the same to find patients with least cholesterol.
Sort HANES to find pateints with high A1C.

The `select()` function

The orginal HANES data set has \(704\) variables/columns. It is common for real world data sets to be large and we may want to quickly zoom into the variables of interest. The select() function allows us to do that. Here is a code to select only A1C and GLUCOSE variables:

  # Select only the A1C and GLUCOSE variables from the HANES data
  s <- select(HANES, A1C, GLUCOSE)
  as.tibble(s)

## # A tibble: 1,112 x 2
##      A1C GLUCOSE
##  * <dbl>   <int>
##  1   5.0      83
##  2   5.2      86
##  3   4.8      93
##  4   5.1      90
##  5   4.3      92
##  6   5.2      85
##  7   4.8      72
##  8   5.2      87
##  9   4.8      96
## 10   5.2      92
## # ... with 1,102 more rows

An useful helper arument is the everything() function that allows keeping all the variables and pushing the variables of interest in the front.

  # Push the A1C and GLUCOSE variables to the front keeping everything
  s <- select(HANES, A1C, GLUCOSE, everything())
  as.tibble(s)

## # A tibble: 1,112 x 23
##      A1C GLUCOSE     KEY GENDER SPAGE AGEGROUP     HSQ_1 UCREATININE
##  * <dbl>   <int>  <fctr> <fctr> <int>   <fctr>    <fctr>       <int>
##  1   5.0      83 134040A      M    29    20-39 Very Good         105
##  2   5.2      86 134460B      M    28    20-39 Very Good          53
##  3   4.8      93 134490A      M    27    20-39 Very Good         314
##  4   5.1      90 134620A      M    24    20-39 Excellent         105
##  5   4.3      92 134670A      M    30    20-39 Excellent         163
##  6   5.2      85 135210B      M    26    20-39      Good         150
##  7   4.8      72 135420A      M    31    20-39 Excellent          46
##  8   5.2      87 135740A      M    32    20-39 Very Good          36
##  9   4.8      96 136020A      M    34    20-39 Excellent         177
## 10   5.2      92 136250A      M    32    20-39      Good         156
## # ... with 1,102 more rows, and 15 more variables: UALBUMIN <dbl>,
## #   UACR <dbl>, MERCURYU <dbl>, DX_DBTS <fctr>, CADMIUM <dbl>, LEAD <dbl>,
## #   MERCURYTOTALBLOOD <dbl>, HDL <int>, CHOLESTEROLTOTAL <int>,
## #   GLUCOSESI <dbl>, CREATININESI <dbl>, CREATININE <dbl>,
## #   TRIGLYCERIDE <int>, COTININE <dbl>, LDLESTIMATE <int>

The `mutate()` function

We can add new columns to the data set using the mutate() function. For example, this code makes a new variable that is the sum of HDL and LDL estimate, stores it in a variable “H_L” to find the ratio between cholesterol and this new variable, H_L:

   # Select only the variables LDLESTIMATE,HDL,CHOLESTEROLTOTAL,TRIGLYCERIDE
   HANES_sub <- select(HANES,LDLESTIMATE,HDL,CHOLESTEROLTOTAL,TRIGLYCERIDE)
   # Add a new variable H_L that is a sum of HDL and LDL
   m <- mutate(HANES_sub, H_L = HDL+LDLESTIMATE, ratio=CHOLESTEROLTOTAL/H_L)
   as.tibble(m)

## # A tibble: 1,112 x 6
##    LDLESTIMATE   HDL CHOLESTEROLTOTAL TRIGLYCERIDE   H_L    ratio
##          <int> <int>            <int>        <int> <int>    <dbl>
##  1         111    42              184          156   153 1.202614
##  2          97    51              157           43   148 1.060811
##  3          81    42              145          108   123 1.178862
##  4         132    61              206           65   193 1.067358
##  5          58    52              120           51   110 1.090909
##  6          99    50              155           29   149 1.040268
##  7          93    57              156           31   150 1.040000
##  8         135    56              235          220   191 1.230366
##  9          98    42              156           82   140 1.114286
## 10          69    44              120           35   113 1.061947
## # ... with 1,102 more rows

The transmute() function retains only the created variables.

   # Select only the variables LDLESTIMATE,HDL,CHOLESTEROLTOTAL,TRIGLYCERIDE
   HANES_sub <- select(HANES,LDLESTIMATE,HDL,CHOLESTEROLTOTAL,TRIGLYCERIDE)
   # Add a new variable H_L that is a sum of HDL and LDL
   m <- transmute(HANES_sub, H_L = HDL+LDLESTIMATE, ratio=CHOLESTEROLTOTAL/H_L)
   as.tibble(m)

## # A tibble: 1,112 x 2
##      H_L    ratio
##    <int>    <dbl>
##  1   153 1.202614
##  2   148 1.060811
##  3   123 1.178862
##  4   193 1.067358
##  5   110 1.090909
##  6   149 1.040268
##  7   150 1.040000
##  8   191 1.230366
##  9   140 1.114286
## 10   113 1.061947
## # ... with 1,102 more rows

There are several useful functions for data manipulation such as lag(), lead(), cumsum(), min_rank() etc. and wrappers around them. One of them is the top_n() function that is made out of the filter() and min_rank() functions that would list the top \(n\) values from the data. For example, here is a code to select the top \(5\) values in the above ratio.

  # Select the top 5 ratios in the above transmuted data frame m
  top_5_ratio <-  top_n(m, 5, ratio)
  top_5_ratio

##   H_L    ratio
## 1 121 1.570248
## 2 141 1.588652
## 3 107 1.654206
## 4 121 1.586777
## 5  85 1.752941

The `group_by()` and `summarise()` functions

The group_by() is an extremely useful function that facililates grouped analysis and along with summarise() they can provide vital statistics for a group. Let us say we want to find the mean cholesterol level for each gender, the group_by() and summarise() operation can be extremely handy.

  # Group the HANES data by GENDER 
  HANES_by_gender <- group_by(HANES, GENDER)
  # Find the mean cholesterol level for the two GENDERS
  summarise(HANES_by_gender, mean_cholesterol = mean(CHOLESTEROLTOTAL))

## # A tibble: 2 x 2
##   GENDER mean_cholesterol
##   <fctr>            <dbl>
## 1      M         183.6392
## 2      F         187.3903

We find the mean cholesterol levels for females are little higher than males.

The summarise() has a useful function called the count() that will count the number of items appearing in a group. For example, here is an analysis that shows the relationship between HDL and LDL for each age, grouped by the number of patients with the particular age.

  # First group the patients by age
  by_AGE <- group_by(HANES, SPAGE)
  # Summarize by counting the number of people in that age, and finding the mean HDL and LDL values
  DL <- summarise(by_AGE,
                   count = n(),
                   H = mean(HDL),
                   L = mean(LDLESTIMATE))
  # We remove the groups with five people or less
  DL <- filter(DL,count > 5)
  # Plot this data, with point for count estimates and smoothing it by line
  ggplot(data = DL, mapping = aes(x = H, y = L)) +
    geom_point(aes(size = count), alpha = 1/3) +
    geom_smooth(se = FALSE)

The big disadvantage in the above code is the necessity to re-type the variables again and again, which can be eliminated through the use of %>% symbol. As the analysis becomes complicated, piping becomes extremely useful. The following code is the piped version of the above code producing the exact same output.

  # Pipe the HANES data to the group_by function, grouping by age
  DL <- HANES %>% group_by(SPAGE) %>% 
  # and pipe this to summarize function
  summarise(count = n(),
            H = mean(HDL),
            L = mean(LDLESTIMATE)) %>% 
  # And pipe this result to filter the count < 5
  filter(count > 5)
  # Plot this data, with point for count estimates and smoothing it by line
  ggplot(data = DL, mapping = aes(x = H, y = L)) +
    geom_point(aes(size = count), alpha = 1/3) +
    geom_smooth(se = FALSE)

Missing values

The use of the function na.rm() inside summarise can be extremely useful as this example shows:

(Note: Since we already removed ‘NA’ from our dataset, we import the new data with NA in it)

  # Load the package RCurl
  library(RCurl)
  # Import the HANES data set from GitHub; break the string into two for readability
  # (Please note this readability aspect very carefully)
  URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
  URL_text_2 <- "/master/Datasets/HANES/NYC_HANES_DIAB.csv"
  # Paste it to constitute a single URL 
  URL <- paste(URL_text_1,URL_text_2, sep="")
  HANES_with_NA <- read.csv(text=getURL(URL))

Without removing NA

  # Group by AGE and GENDER and
  HANES_with_NA %>% 
  group_by(SPAGE,GENDER) %>% 
  # summarize the mean of cholesterol for each group without removing NA
  summarise(mean = mean(CHOLESTEROLTOTAL))

## # A tibble: 143 x 3
## # Groups:   SPAGE [?]
##    SPAGE GENDER     mean
##    <int>  <int>    <dbl>
##  1    20      1       NA
##  2    20      2       NA
##  3    21      1       NA
##  4    21      2       NA
##  5    22      1 153.4286
##  6    22      2       NA
##  7    23      1       NA
##  8    23      2       NA
##  9    24      1       NA
## 10    24      2       NA
## # ... with 133 more rows

Removing NA

  # Group by AGE and GENDER and 
  HANES_with_NA %>% 
  group_by(SPAGE,GENDER) %>% 
  # summarize the mean of cholesterol for each group by removing NA
  summarise(mean = mean(CHOLESTEROLTOTAL, na.rm=TRUE))

## # A tibble: 143 x 3
## # Groups:   SPAGE [?]
##    SPAGE GENDER     mean
##    <int>  <int>    <dbl>
##  1    20      1 165.1250
##  2    20      2 157.3333
##  3    21      1 151.8000
##  4    21      2 154.2222
##  5    22      1 153.4286
##  6    22      2 167.2222
##  7    23      1 165.9286
##  8    23      2 169.7000
##  9    24      1 169.4286
## 10    24      2 173.9500
## # ... with 133 more rows

We can also filter through the missing values first -

  # Filter the data and get no NA values 
  HANES_with_NA %>% filter(!is.na(CHOLESTEROLTOTAL)) %>%
  # group by AGE and GENDER 
  group_by(SPAGE,GENDER) %>% 
  # and find the mean of cholesterol for each group
  summarise(mean = mean(CHOLESTEROLTOTAL, na.rm=TRUE))

## # A tibble: 140 x 3
## # Groups:   SPAGE [?]
##    SPAGE GENDER     mean
##    <int>  <int>    <dbl>
##  1    20      1 165.1250
##  2    20      2 157.3333
##  3    21      1 151.8000
##  4    21      2 154.2222
##  5    22      1 153.4286
##  6    22      2 167.2222
##  7    23      1 165.9286
##  8    23      2 169.7000
##  9    24      1 169.4286
## 10    24      2 173.9500
## # ... with 130 more rows

Whenever we do aggregations using group_by() it is good to include count information that will give the information on the distribution and extreme values.

  # Group by AGE, GENDER
  cholesterol <- HANES %>% 
  group_by(SPAGE,GENDER) %>%
  # and summarise mean cholesterol
  summarise(mean_cholesterol = mean(CHOLESTEROLTOTAL))

  # We can then plot the count information
  ggplot(data = cholesterol, mapping = aes(x = mean_cholesterol)) + 
  geom_freqpoly(binwidth = 10)

This reveals that there are patients with extremely high cholesterol (\(300\) or more). How does the cholesterol vary over the age group? We can get more insight if we draw a plot of the ratio HDL/LDL and cholesterol for each agegroup:

  # For each AGEGROUP find the mean H_L ratio and cholesterol information
  CHO_L_D <- HANES %>% 
  group_by(AGEGROUP) %>% 
  summarise(H_L = mean(HDL) / mean(LDLESTIMATE),
            CHO = mean(CHOLESTEROLTOTAL))

  # Plot this information 
  CHO_L_D %>% 
  ggplot(mapping = aes(x = H_L, y = CHO)) +
  geom_point() + 
  geom_smooth(se = FALSE)

This reveals the cholesterol peaks at the age group \(40-59\) and remains almost stable, while the age group \(20-39\) has the least cholesterol.

There are several useful summary functions:

Example 1: Finding the mean of CHOLESTEROLTOTAL and the mean of CHOLESTEROLTOTAL if GLUCOSE >100 for each age group.

  DL <- HANES %>% 
    # For each AGEGROUP,
    group_by(AGEGROUP) %>%
    # find the mean of CHOLESTEROLTOTAL and the mean of CHOLESTEROLTOTAL when GLUCOSE content is > 100
    summarise(count = n(),
              H = mean(CHOLESTEROLTOTAL),
              H_glu = mean(CHOLESTEROLTOTAL[GLUCOSE>100])) 
  # And print it
  DL

## # A tibble: 3 x 4
##   AGEGROUP count        H    H_glu
##     <fctr> <int>    <dbl>    <dbl>
## 1    20-39   498 176.2209 182.6667
## 2    40-59   391 194.2199 188.9057
## 3      60+   223 192.3857 186.2857

Example 2: Finding the standard deviation of CHOLESTEROLTOTAL by age group

  DL <- HANES %>% 
  # For each AGEGROUP,
  group_by(AGEGROUP) %>% 
  # sumamarise the standard deviation of CHOLESTEROLTOTAL and arrange in descending order
  summarise(CHO_sd = sd(CHOLESTEROLTOTAL)) %>% 
            arrange(desc(CHO_sd))
  # and print it
  DL

## # A tibble: 3 x 2
##   AGEGROUP   CHO_sd
##     <fctr>    <dbl>
## 1      60+ 41.75793
## 2    40-59 36.72472
## 3    20-39 34.19996

Example 3: Finding largest and smallest CHOLESTEROLTOTAL for each diabetes diagnosis group.

  DL <- HANES %>% 
  # For each DX_DBTS,
  group_by(DX_DBTS) %>% 
  # find the largest and smallest CHOLESTEROLTOTAL value
  summarise(Lartgest_CHO = max(CHOLESTEROLTOTAL), Smallest_CHO=min(CHOLESTEROLTOTAL)) 
  # and print it
  DL

## # A tibble: 3 x 3
##      DX_DBTS Lartgest_CHO Smallest_CHO
##       <fctr>        <dbl>        <dbl>
## 1       DIAB          315          105
## 2 DIAB NO_DX          246          132
## 3    NO DIAB          335           95

Example 4: Finding largest and smallest CHOLESTEROLTOTAL for each diabetes diagnosis group and by age group.

  DL <- HANES %>% 
  # For each DX_DBTS and AGEGROUP,
  group_by(DX_DBTS, AGEGROUP) %>% 
  # find the largest and smallest CHOLESTEROLTOTAL value
  summarise(Lartgest_CHO = max(CHOLESTEROLTOTAL), Smallest_CHO=min(CHOLESTEROLTOTAL)) 
  # and print it
  DL

## # A tibble: 9 x 4
## # Groups:   DX_DBTS [?]
##      DX_DBTS AGEGROUP Lartgest_CHO Smallest_CHO
##       <fctr>   <fctr>        <dbl>        <dbl>
## 1       DIAB    20-39          228          138
## 2       DIAB    40-59          315          112
## 3       DIAB      60+          304          105
## 4 DIAB NO_DX    20-39          221          157
## 5 DIAB NO_DX    40-59          238          141
## 6 DIAB NO_DX      60+          246          132
## 7    NO DIAB    20-39          308           95
## 8    NO DIAB    40-59          335           98
## 9    NO DIAB      60+          313           97

Grouped transformations

Grouping variables is extremely useful when used with the summarise() function, but it can also be used with other operations such as mutate() and filter().

Examples:

Among the diabetes diagnostic within age groups, find the members (top 3 ranks) with high glucose levels.

  # Group by diabetes status and age group,
   HANES %>% group_by(DX_DBTS, AGEGROUP) %>% 
  # filter for the top 3 ranks
   filter(rank(desc(GLUCOSE)) < 3) %>% 
  # select only the relevant variables
  select(DX_DBTS,AGEGROUP,GLUCOSE) %>% 
  # and arrange in descending order of glucose levels
  arrange(desc(GLUCOSE))

## # A tibble: 19 x 3
## # Groups:   DX_DBTS, AGEGROUP [9]
##       DX_DBTS AGEGROUP GLUCOSE
##        <fctr>   <fctr>   <int>
##  1       DIAB      60+     490
##  2       DIAB    40-59     320
##  3       DIAB    40-59     297
##  4       DIAB      60+     281
##  5 DIAB NO_DX    40-59     219
##  6 DIAB NO_DX    40-59     214
##  7 DIAB NO_DX    40-59     214
##  8       DIAB    20-39     212
##  9       DIAB    20-39     186
## 10 DIAB NO_DX      60+     178
## 11 DIAB NO_DX      60+     171
## 12    NO DIAB    40-59     159
## 13    NO DIAB    20-39     149
## 14    NO DIAB      60+     147
## 15 DIAB NO_DX    20-39     140
## 16    NO DIAB      60+     136
## 17    NO DIAB    40-59     125
## 18    NO DIAB    20-39     120
## 19 DIAB NO_DX    20-39      87

Find age group (not the AGEGROUP variable) greater than a threshold (say, \(30\)) and standardise A1C to compute per group metric.

  # Group by SPAGE and filter groups that contain more than 30 individuals
  AG_GT_30 <- HANES %>% group_by(SPAGE) %>%
                        filter(n() > 30)

  # Filter for all records with A1C variable > 0 to avoid NaNs,
  AG_GT_30 %>% filter(A1C > 0) %>%
  # standardise A1C dividing by the total sum of A1C in the group
  mutate(standardised_A1C = A1C / sum(A1C)) %>%
  # and seelct only A1C and standardised A1C
  select(A1C,standardised_A1C)

## # A tibble: 226 x 3
## # Groups:   SPAGE [7]
##    SPAGE   A1C standardised_A1C
##    <int> <dbl>            <dbl>
##  1    29   5.0       0.03203075
##  2    28   5.2       0.03121248
##  3    27   4.8       0.03009404
##  4    24   5.1       0.03240152
##  5    30   4.3       0.02426637
##  6    30   5.3       0.02990971
##  7    28   4.7       0.02821128
##  8    30   4.7       0.02652370
##  9    23   4.9       0.02909739
## 10    30   5.2       0.02934537
## # ... with 216 more rows

Selected materials and references

R for Data Science - Data Transformation Part

Data science fundamentals 02: Transform and explore (Sept. 26, 2017)

Recorded Stream

The filter() function

The arrange() function

The select() function

The mutate() function

The group_by() and summarise() functions