We will work with HANES data set.
In the spirit of the data transformation section in R for Data Science, we will look into dplyr()
functions.
As usual, we will first import the curated HANES data set in RStudio:
# Load the package RCurl
library(RCurl)
# Import the HANES data set from GitHub; break the string into two for readability
# (Please note this readability aspect very carefully)
URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
URL_text_2 <- "/master/Datasets/HANES/NYC_HANES_DIAB.csv"
# Paste it to constitute a single URL
URL <- paste(URL_text_1,URL_text_2, sep="")
HANES <- read.csv(text=getURL(URL))
# Rename the GENDER factor for identification
HANES$GENDER <- factor(HANES$GENDER, labels=c("M","F"))
# Rename the AGEGROUP factor for identification
HANES$AGEGROUP <- factor(HANES$AGEGROUP, labels=c("20-39","40-59","60+"))
# Rename the HSQ_1 factor for identification
HANES$HSQ_1 <- factor(HANES$HSQ_1, labels=c("Excellent","Very Good","Good", "Fair", "Poor"))
# Rename the DX_DBTS as a factor
HANES$DX_DBTS <- factor(HANES$DX_DBTS, labels=c("DIAB","DIAB NO_DX","NO DIAB"))
# Omit all NA from the data frame
HANES <- na.omit(HANES)
# Observe the structure
str(HANES)
## 'data.frame': 1112 obs. of 23 variables:
## $ KEY : Factor w/ 1527 levels "133370A","133370B",..: 28 43 44 53 55 70 84 90 100 107 ...
## $ GENDER : Factor w/ 2 levels "M","F": 1 1 1 1 1 1 1 1 1 1 ...
## $ SPAGE : int 29 28 27 24 30 26 31 32 34 32 ...
## $ AGEGROUP : Factor w/ 3 levels "20-39","40-59",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ HSQ_1 : Factor w/ 5 levels "Excellent","Very Good",..: 2 2 2 1 1 3 1 2 1 3 ...
## $ UCREATININE : int 105 53 314 105 163 150 46 36 177 156 ...
## $ UALBUMIN : num 0.707 1 8 4 3 2 2 0.707 4 3 ...
## $ UACR : num 0.00673 2 3 4 2 ...
## $ MERCURYU : num 0.37 0.106 0.487 2.205 0.979 ...
## $ DX_DBTS : Factor w/ 3 levels "DIAB","DIAB NO_DX",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ A1C : num 5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 5.2 ...
## $ CADMIUM : num 0.2412 0.1732 0.0644 0.0929 0.1202 ...
## $ LEAD : num 1.454 1.019 0.863 1.243 0.612 ...
## $ MERCURYTOTALBLOOD: num 2.34 2.57 1.32 14.66 2.13 ...
## $ HDL : int 42 51 42 61 52 50 57 56 42 44 ...
## $ CHOLESTEROLTOTAL : int 184 157 145 206 120 155 156 235 156 120 ...
## $ GLUCOSESI : num 4.61 4.77 5.16 5 5.11 ...
## $ CREATININESI : num 74.3 73 80 84.9 66 ...
## $ CREATININE : num 0.84 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 1.09 ...
## $ TRIGLYCERIDE : int 156 43 108 65 51 29 31 220 82 35 ...
## $ GLUCOSE : int 83 86 93 90 92 85 72 87 96 92 ...
## $ COTININE : num 31.5918 0.0635 0.035 0.0514 0.035 ...
## $ LDLESTIMATE : int 111 97 81 132 58 99 93 135 98 69 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:415] 2 15 16 24 26 28 33 34 35 39 ...
## .. ..- attr(*, "names")= chr [1:415] "2" "15" "16" "24" ...
We will see more about tibbles in the wrangle section but we will make a short note of it now.
A tibble, or tbl_df
, is a modern reimagining of the data frame, keeping what time has proven to be effective, and throwing out what is not.
Tibbles are data.frames that are lazy and surly: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist).
This forces one to confront problems earlier, typically leading to cleaner, more expressive code.
Tibbles also have an enhanced print method()
which makes them easier to use with large datasets containing complex objects.
To convert a data frame to a tibble, we can use the function as.tibble(df)
where df is a data frame.
We will convert the HANES data frame into tibble and use the tibble from now on.
# Load the tidyverse library
library(tidyverse)
# Convert HANES data frame into a tibble and observe it
HANES_TIB <- as.tibble(HANES)
HANES_TIB
## # A tibble: 1,112 x 23
## KEY GENDER SPAGE AGEGROUP HSQ_1 UCREATININE UALBUMIN
## * <fctr> <fctr> <int> <fctr> <fctr> <int> <dbl>
## 1 134040A M 29 20-39 Very Good 105 0.707
## 2 134460B M 28 20-39 Very Good 53 1.000
## 3 134490A M 27 20-39 Very Good 314 8.000
## 4 134620A M 24 20-39 Excellent 105 4.000
## 5 134670A M 30 20-39 Excellent 163 3.000
## 6 135210B M 26 20-39 Good 150 2.000
## 7 135420A M 31 20-39 Excellent 46 2.000
## 8 135740A M 32 20-39 Very Good 36 0.707
## 9 136020A M 34 20-39 Excellent 177 4.000
## 10 136250A M 32 20-39 Good 156 3.000
## # ... with 1,102 more rows, and 16 more variables: UACR <dbl>,
## # MERCURYU <dbl>, DX_DBTS <fctr>, A1C <dbl>, CADMIUM <dbl>, LEAD <dbl>,
## # MERCURYTOTALBLOOD <dbl>, HDL <int>, CHOLESTEROLTOTAL <int>,
## # GLUCOSESI <dbl>, CREATININESI <dbl>, CREATININE <dbl>,
## # TRIGLYCERIDE <int>, GLUCOSE <int>, COTININE <dbl>, LDLESTIMATE <int>
Tranformation basics are six key dplyr
library functions that will allow us to solve vast majority of data manipulation challenges:
We can pick observations by their values using the function filter()
.
Reorder the rows applying arrange()
operation to a data set.
We can choose variables by their names using select()
.
Create new variables through mutate()
.
Find the summary using summarise()
function.
And the most important of all which is the group_by()
operation which can be used with any of the above five.
filter()
functionThe filter function allows us to pick a subset of information (or rows) from our dataset.
# Pick all the records of patients with age == 45 and HDL value of 50
filter(HANES, SPAGE == 45, HDL == 50)
## KEY GENDER SPAGE AGEGROUP HSQ_1 UCREATININE UALBUMIN UACR MERCURYU
## 1 329120A M 45 40-59 Good 31 2 6 0.1822157
## 2 329270B F 45 40-59 Good 150 5 3 0.5092553
## DX_DBTS A1C CADMIUM LEAD MERCURYTOTALBLOOD HDL CHOLESTEROLTOTAL
## 1 NO DIAB 4.8 0.9012 3.9006 3.2254 50 192
## 2 DIAB 6.8 0.1798 0.8202 0.8664 50 144
## GLUCOSESI CREATININESI CREATININE TRIGLYCERIDE GLUCOSE COTININE
## 1 4.72 92 1.04 108 85 0.0757
## 2 9.05 57 0.64 133 163 0.3407
## LDLESTIMATE
## 1 120
## 2 67
Note: All dplyr
functions write the result to a new data frame. If we want to store the data, we need to assign the result to a variable.
We can also use comparison and logical operators. For instance,
# Pick all the records of patients who are between 45 and 46 and HDL value of 50
filter(HANES, SPAGE >= 45 & SPAGE <= 46, HDL == 50)
## KEY GENDER SPAGE AGEGROUP HSQ_1 UCREATININE UALBUMIN UACR MERCURYU
## 1 329120A M 45 40-59 Good 31 2 6 0.1822157
## 2 550110A F 46 40-59 Good 54 3 6 0.3860837
## 3 329270B F 45 40-59 Good 150 5 3 0.5092553
## DX_DBTS A1C CADMIUM LEAD MERCURYTOTALBLOOD HDL CHOLESTEROLTOTAL
## 1 NO DIAB 4.8 0.9012 3.9006 3.2254 50 192
## 2 NO DIAB 5.1 0.3584 0.9674 1.2028 50 188
## 3 DIAB 6.8 0.1798 0.8202 0.8664 50 144
## GLUCOSESI CREATININESI CREATININE TRIGLYCERIDE GLUCOSE COTININE
## 1 4.72 92 1.04 108 85 0.0757
## 2 5.16 59 0.67 77 93 0.0350
## 3 9.05 57 0.64 133 163 0.3407
## LDLESTIMATE
## 1 120
## 2 123
## 3 67
will list all the records for patients with age between \(45\) and \(46\) and who have the HDL value of \(50\).
Note: filter()
only includes rows where the condition is TRUE; it excludes both FALSE and NA values.
Classwork/Homework:
arrange()
functionarrange(
) works similar to filter function except that instead of selecting rows, it changes their order.
It takes a data frame and a set of column names (or more complicated expressions) to order by.
If we provide more than one column name, each additional column will be used to break ties in the values of preceding columns:
# Arrange the data frame arranging GENDER and AGE with descending order of the variable UCREATININE
a <- arrange(HANES, GENDER, SPAGE, desc(UCREATININE))
atib <- as.tibble(a)
atib
## # A tibble: 1,112 x 23
## KEY GENDER SPAGE AGEGROUP HSQ_1 UCREATININE UALBUMIN
## <fctr> <fctr> <int> <fctr> <fctr> <int> <dbl>
## 1 134310B M 20 20-39 Excellent 612 26.000
## 2 550030A M 20 20-39 Very Good 217 12.000
## 3 220620A M 20 20-39 Very Good 198 15.000
## 4 333190A M 20 20-39 Good 153 5.000
## 5 443500A M 20 20-39 Very Good 148 1.000
## 6 140330B M 20 20-39 Very Good 132 3.000
## 7 447750A M 20 20-39 Excellent 131 3.000
## 8 223980B M 21 20-39 Excellent 269 1.000
## 9 223930B M 21 20-39 Very Good 205 0.707
## 10 328780B M 21 20-39 Very Good 144 25.000
## # ... with 1,102 more rows, and 16 more variables: UACR <dbl>,
## # MERCURYU <dbl>, DX_DBTS <fctr>, A1C <dbl>, CADMIUM <dbl>, LEAD <dbl>,
## # MERCURYTOTALBLOOD <dbl>, HDL <int>, CHOLESTEROLTOTAL <int>,
## # GLUCOSESI <dbl>, CREATININESI <dbl>, CREATININE <dbl>,
## # TRIGLYCERIDE <int>, GLUCOSE <int>, COTININE <dbl>, LDLESTIMATE <int>
Note: Missing values are always sorted at the end.
To view the last few entries of the data frame/tibble, one can use the tail()
function.
tail(atib)
## # A tibble: 6 x 23
## KEY GENDER SPAGE AGEGROUP HSQ_1 UCREATININE UALBUMIN UACR
## <fctr> <fctr> <int> <fctr> <fctr> <int> <dbl> <dbl>
## 1 332900A F 88 60+ Fair 107 11.000 10.00000000
## 2 224220A F 88 60+ Fair 89 8.000 9.00000000
## 3 446510A F 88 60+ Fair 60 0.707 0.01178333
## 4 138450B F 89 60+ Good 181 7.000 3.87000000
## 5 442720A F 90 60+ Good 309 59.000 19.00000000
## 6 140950A F 92 60+ Good 73 15.000 20.00000000
## # ... with 15 more variables: MERCURYU <dbl>, DX_DBTS <fctr>, A1C <dbl>,
## # CADMIUM <dbl>, LEAD <dbl>, MERCURYTOTALBLOOD <dbl>, HDL <int>,
## # CHOLESTEROLTOTAL <int>, GLUCOSESI <dbl>, CREATININESI <dbl>,
## # CREATININE <dbl>, TRIGLYCERIDE <int>, GLUCOSE <int>, COTININE <dbl>,
## # LDLESTIMATE <int>
Classwork/Homework:
Sort HANES to find patients with highest cholesterol. Do the same to find patients with least cholesterol.
Sort HANES to find pateints with high A1C.
select()
functionThe orginal HANES data set has \(704\) variables/columns.
It is common for real world data sets to be large and we may want to quickly zoom into the variables of interest.
The select()
function allows us to do that. Here is a code to select only A1C and GLUCOSE variables:
# Select only the A1C and GLUCOSE variables from the HANES data
s <- select(HANES, A1C, GLUCOSE)
as.tibble(s)
## # A tibble: 1,112 x 2
## A1C GLUCOSE
## * <dbl> <int>
## 1 5.0 83
## 2 5.2 86
## 3 4.8 93
## 4 5.1 90
## 5 4.3 92
## 6 5.2 85
## 7 4.8 72
## 8 5.2 87
## 9 4.8 96
## 10 5.2 92
## # ... with 1,102 more rows
An useful helper arument is the everything()
function that allows keeping all the variables and pushing the variables of interest in the front.
# Push the A1C and GLUCOSE variables to the front keeping everything
s <- select(HANES, A1C, GLUCOSE, everything())
as.tibble(s)
## # A tibble: 1,112 x 23
## A1C GLUCOSE KEY GENDER SPAGE AGEGROUP HSQ_1 UCREATININE
## * <dbl> <int> <fctr> <fctr> <int> <fctr> <fctr> <int>
## 1 5.0 83 134040A M 29 20-39 Very Good 105
## 2 5.2 86 134460B M 28 20-39 Very Good 53
## 3 4.8 93 134490A M 27 20-39 Very Good 314
## 4 5.1 90 134620A M 24 20-39 Excellent 105
## 5 4.3 92 134670A M 30 20-39 Excellent 163
## 6 5.2 85 135210B M 26 20-39 Good 150
## 7 4.8 72 135420A M 31 20-39 Excellent 46
## 8 5.2 87 135740A M 32 20-39 Very Good 36
## 9 4.8 96 136020A M 34 20-39 Excellent 177
## 10 5.2 92 136250A M 32 20-39 Good 156
## # ... with 1,102 more rows, and 15 more variables: UALBUMIN <dbl>,
## # UACR <dbl>, MERCURYU <dbl>, DX_DBTS <fctr>, CADMIUM <dbl>, LEAD <dbl>,
## # MERCURYTOTALBLOOD <dbl>, HDL <int>, CHOLESTEROLTOTAL <int>,
## # GLUCOSESI <dbl>, CREATININESI <dbl>, CREATININE <dbl>,
## # TRIGLYCERIDE <int>, COTININE <dbl>, LDLESTIMATE <int>
mutate()
functionWe can add new columns to the data set using the mutate()
function.
For example, this code makes a new variable that is the sum of HDL and LDL estimate, stores it in a variable “H_L” to find the ratio between cholesterol and this new variable, H_L:
# Select only the variables LDLESTIMATE,HDL,CHOLESTEROLTOTAL,TRIGLYCERIDE
HANES_sub <- select(HANES,LDLESTIMATE,HDL,CHOLESTEROLTOTAL,TRIGLYCERIDE)
# Add a new variable H_L that is a sum of HDL and LDL
m <- mutate(HANES_sub, H_L = HDL+LDLESTIMATE, ratio=CHOLESTEROLTOTAL/H_L)
as.tibble(m)
## # A tibble: 1,112 x 6
## LDLESTIMATE HDL CHOLESTEROLTOTAL TRIGLYCERIDE H_L ratio
## <int> <int> <int> <int> <int> <dbl>
## 1 111 42 184 156 153 1.202614
## 2 97 51 157 43 148 1.060811
## 3 81 42 145 108 123 1.178862
## 4 132 61 206 65 193 1.067358
## 5 58 52 120 51 110 1.090909
## 6 99 50 155 29 149 1.040268
## 7 93 57 156 31 150 1.040000
## 8 135 56 235 220 191 1.230366
## 9 98 42 156 82 140 1.114286
## 10 69 44 120 35 113 1.061947
## # ... with 1,102 more rows
The transmute()
function retains only the created variables.
# Select only the variables LDLESTIMATE,HDL,CHOLESTEROLTOTAL,TRIGLYCERIDE
HANES_sub <- select(HANES,LDLESTIMATE,HDL,CHOLESTEROLTOTAL,TRIGLYCERIDE)
# Add a new variable H_L that is a sum of HDL and LDL
m <- transmute(HANES_sub, H_L = HDL+LDLESTIMATE, ratio=CHOLESTEROLTOTAL/H_L)
as.tibble(m)
## # A tibble: 1,112 x 2
## H_L ratio
## <int> <dbl>
## 1 153 1.202614
## 2 148 1.060811
## 3 123 1.178862
## 4 193 1.067358
## 5 110 1.090909
## 6 149 1.040268
## 7 150 1.040000
## 8 191 1.230366
## 9 140 1.114286
## 10 113 1.061947
## # ... with 1,102 more rows
There are several useful functions for data manipulation such as lag()
, lead()
, cumsum()
, min_rank()
etc. and wrappers around them.
One of them is the top_n()
function that is made out of the filter()
and min_rank()
functions that would list the top \(n\) values from the data.
For example, here is a code to select the top \(5\) values in the above ratio.
# Select the top 5 ratios in the above transmuted data frame m
top_5_ratio <- top_n(m, 5, ratio)
top_5_ratio
## H_L ratio
## 1 121 1.570248
## 2 141 1.588652
## 3 107 1.654206
## 4 121 1.586777
## 5 85 1.752941
group_by()
and summarise()
functionsThe group_by()
is an extremely useful function that facililates grouped analysis and along with summarise()
they can provide vital statistics for a group.
Let us say we want to find the mean cholesterol level for each gender, the group_by()
and summarise()
operation can be extremely handy.
# Group the HANES data by GENDER
HANES_by_gender <- group_by(HANES, GENDER)
# Find the mean cholesterol level for the two GENDERS
summarise(HANES_by_gender, mean_cholesterol = mean(CHOLESTEROLTOTAL))
## # A tibble: 2 x 2
## GENDER mean_cholesterol
## <fctr> <dbl>
## 1 M 183.6392
## 2 F 187.3903
We find the mean cholesterol levels for females are little higher than males.
The summarise()
has a useful function called the count()
that will count the number of items appearing in a group.
For example, here is an analysis that shows the relationship between HDL and LDL for each age, grouped by the number of patients with the particular age.
# First group the patients by age
by_AGE <- group_by(HANES, SPAGE)
# Summarize by counting the number of people in that age, and finding the mean HDL and LDL values
DL <- summarise(by_AGE,
count = n(),
H = mean(HDL),
L = mean(LDLESTIMATE))
# We remove the groups with five people or less
DL <- filter(DL,count > 5)
# Plot this data, with point for count estimates and smoothing it by line
ggplot(data = DL, mapping = aes(x = H, y = L)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
The big disadvantage in the above code is the necessity to re-type the variables again and again, which can be eliminated through the use of %>%
symbol.
As the analysis becomes complicated, piping becomes extremely useful. The following code is the piped version of the above code producing the exact same output.
# Pipe the HANES data to the group_by function, grouping by age
DL <- HANES %>% group_by(SPAGE) %>%
# and pipe this to summarize function
summarise(count = n(),
H = mean(HDL),
L = mean(LDLESTIMATE)) %>%
# And pipe this result to filter the count < 5
filter(count > 5)
# Plot this data, with point for count estimates and smoothing it by line
ggplot(data = DL, mapping = aes(x = H, y = L)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
The use of the function na.rm()
inside summarise can be extremely useful as this example shows:
(Note: Since we already removed ‘NA’ from our dataset, we import the new data with NA in it)
# Load the package RCurl
library(RCurl)
# Import the HANES data set from GitHub; break the string into two for readability
# (Please note this readability aspect very carefully)
URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
URL_text_2 <- "/master/Datasets/HANES/NYC_HANES_DIAB.csv"
# Paste it to constitute a single URL
URL <- paste(URL_text_1,URL_text_2, sep="")
HANES_with_NA <- read.csv(text=getURL(URL))
Without removing NA
# Group by AGE and GENDER and
HANES_with_NA %>%
group_by(SPAGE,GENDER) %>%
# summarize the mean of cholesterol for each group without removing NA
summarise(mean = mean(CHOLESTEROLTOTAL))
## # A tibble: 143 x 3
## # Groups: SPAGE [?]
## SPAGE GENDER mean
## <int> <int> <dbl>
## 1 20 1 NA
## 2 20 2 NA
## 3 21 1 NA
## 4 21 2 NA
## 5 22 1 153.4286
## 6 22 2 NA
## 7 23 1 NA
## 8 23 2 NA
## 9 24 1 NA
## 10 24 2 NA
## # ... with 133 more rows
Removing NA
# Group by AGE and GENDER and
HANES_with_NA %>%
group_by(SPAGE,GENDER) %>%
# summarize the mean of cholesterol for each group by removing NA
summarise(mean = mean(CHOLESTEROLTOTAL, na.rm=TRUE))
## # A tibble: 143 x 3
## # Groups: SPAGE [?]
## SPAGE GENDER mean
## <int> <int> <dbl>
## 1 20 1 165.1250
## 2 20 2 157.3333
## 3 21 1 151.8000
## 4 21 2 154.2222
## 5 22 1 153.4286
## 6 22 2 167.2222
## 7 23 1 165.9286
## 8 23 2 169.7000
## 9 24 1 169.4286
## 10 24 2 173.9500
## # ... with 133 more rows
We can also filter through the missing values first -
# Filter the data and get no NA values
HANES_with_NA %>% filter(!is.na(CHOLESTEROLTOTAL)) %>%
# group by AGE and GENDER
group_by(SPAGE,GENDER) %>%
# and find the mean of cholesterol for each group
summarise(mean = mean(CHOLESTEROLTOTAL, na.rm=TRUE))
## # A tibble: 140 x 3
## # Groups: SPAGE [?]
## SPAGE GENDER mean
## <int> <int> <dbl>
## 1 20 1 165.1250
## 2 20 2 157.3333
## 3 21 1 151.8000
## 4 21 2 154.2222
## 5 22 1 153.4286
## 6 22 2 167.2222
## 7 23 1 165.9286
## 8 23 2 169.7000
## 9 24 1 169.4286
## 10 24 2 173.9500
## # ... with 130 more rows
Whenever we do aggregations using group_by()
it is good to include count information that will give the information on the distribution and extreme values.
# Group by AGE, GENDER
cholesterol <- HANES %>%
group_by(SPAGE,GENDER) %>%
# and summarise mean cholesterol
summarise(mean_cholesterol = mean(CHOLESTEROLTOTAL))
# We can then plot the count information
ggplot(data = cholesterol, mapping = aes(x = mean_cholesterol)) +
geom_freqpoly(binwidth = 10)
This reveals that there are patients with extremely high cholesterol (\(300\) or more).
How does the cholesterol vary over the age group? We can get more insight if we draw a plot of the ratio HDL/LDL and cholesterol for each agegroup:
# For each AGEGROUP find the mean H_L ratio and cholesterol information
CHO_L_D <- HANES %>%
group_by(AGEGROUP) %>%
summarise(H_L = mean(HDL) / mean(LDLESTIMATE),
CHO = mean(CHOLESTEROLTOTAL))
# Plot this information
CHO_L_D %>%
ggplot(mapping = aes(x = H_L, y = CHO)) +
geom_point() +
geom_smooth(se = FALSE)
This reveals the cholesterol peaks at the age group \(40-59\) and remains almost stable, while the age group \(20-39\) has the least cholesterol.
There are several useful summary functions:
Example 1: Finding the mean of CHOLESTEROLTOTAL and the mean of CHOLESTEROLTOTAL if GLUCOSE >100 for each age group.
DL <- HANES %>%
# For each AGEGROUP,
group_by(AGEGROUP) %>%
# find the mean of CHOLESTEROLTOTAL and the mean of CHOLESTEROLTOTAL when GLUCOSE content is > 100
summarise(count = n(),
H = mean(CHOLESTEROLTOTAL),
H_glu = mean(CHOLESTEROLTOTAL[GLUCOSE>100]))
# And print it
DL
## # A tibble: 3 x 4
## AGEGROUP count H H_glu
## <fctr> <int> <dbl> <dbl>
## 1 20-39 498 176.2209 182.6667
## 2 40-59 391 194.2199 188.9057
## 3 60+ 223 192.3857 186.2857
Example 2: Finding the standard deviation of CHOLESTEROLTOTAL by age group
DL <- HANES %>%
# For each AGEGROUP,
group_by(AGEGROUP) %>%
# sumamarise the standard deviation of CHOLESTEROLTOTAL and arrange in descending order
summarise(CHO_sd = sd(CHOLESTEROLTOTAL)) %>%
arrange(desc(CHO_sd))
# and print it
DL
## # A tibble: 3 x 2
## AGEGROUP CHO_sd
## <fctr> <dbl>
## 1 60+ 41.75793
## 2 40-59 36.72472
## 3 20-39 34.19996
Example 3: Finding largest and smallest CHOLESTEROLTOTAL for each diabetes diagnosis group.
DL <- HANES %>%
# For each DX_DBTS,
group_by(DX_DBTS) %>%
# find the largest and smallest CHOLESTEROLTOTAL value
summarise(Lartgest_CHO = max(CHOLESTEROLTOTAL), Smallest_CHO=min(CHOLESTEROLTOTAL))
# and print it
DL
## # A tibble: 3 x 3
## DX_DBTS Lartgest_CHO Smallest_CHO
## <fctr> <dbl> <dbl>
## 1 DIAB 315 105
## 2 DIAB NO_DX 246 132
## 3 NO DIAB 335 95
Example 4: Finding largest and smallest CHOLESTEROLTOTAL for each diabetes diagnosis group and by age group.
DL <- HANES %>%
# For each DX_DBTS and AGEGROUP,
group_by(DX_DBTS, AGEGROUP) %>%
# find the largest and smallest CHOLESTEROLTOTAL value
summarise(Lartgest_CHO = max(CHOLESTEROLTOTAL), Smallest_CHO=min(CHOLESTEROLTOTAL))
# and print it
DL
## # A tibble: 9 x 4
## # Groups: DX_DBTS [?]
## DX_DBTS AGEGROUP Lartgest_CHO Smallest_CHO
## <fctr> <fctr> <dbl> <dbl>
## 1 DIAB 20-39 228 138
## 2 DIAB 40-59 315 112
## 3 DIAB 60+ 304 105
## 4 DIAB NO_DX 20-39 221 157
## 5 DIAB NO_DX 40-59 238 141
## 6 DIAB NO_DX 60+ 246 132
## 7 NO DIAB 20-39 308 95
## 8 NO DIAB 40-59 335 98
## 9 NO DIAB 60+ 313 97
Grouping variables is extremely useful when used with the summarise()
function, but it can also be used with other operations such as mutate()
and filter()
.
Examples:
Among the diabetes diagnostic within age groups, find the members (top 3 ranks) with high glucose levels.
# Group by diabetes status and age group,
HANES %>% group_by(DX_DBTS, AGEGROUP) %>%
# filter for the top 3 ranks
filter(rank(desc(GLUCOSE)) < 3) %>%
# select only the relevant variables
select(DX_DBTS,AGEGROUP,GLUCOSE) %>%
# and arrange in descending order of glucose levels
arrange(desc(GLUCOSE))
## # A tibble: 19 x 3
## # Groups: DX_DBTS, AGEGROUP [9]
## DX_DBTS AGEGROUP GLUCOSE
## <fctr> <fctr> <int>
## 1 DIAB 60+ 490
## 2 DIAB 40-59 320
## 3 DIAB 40-59 297
## 4 DIAB 60+ 281
## 5 DIAB NO_DX 40-59 219
## 6 DIAB NO_DX 40-59 214
## 7 DIAB NO_DX 40-59 214
## 8 DIAB 20-39 212
## 9 DIAB 20-39 186
## 10 DIAB NO_DX 60+ 178
## 11 DIAB NO_DX 60+ 171
## 12 NO DIAB 40-59 159
## 13 NO DIAB 20-39 149
## 14 NO DIAB 60+ 147
## 15 DIAB NO_DX 20-39 140
## 16 NO DIAB 60+ 136
## 17 NO DIAB 40-59 125
## 18 NO DIAB 20-39 120
## 19 DIAB NO_DX 20-39 87
Find age group (not the AGEGROUP variable) greater than a threshold (say, \(30\)) and standardise A1C to compute per group metric.
# Group by SPAGE and filter groups that contain more than 30 individuals
AG_GT_30 <- HANES %>% group_by(SPAGE) %>%
filter(n() > 30)
# Filter for all records with A1C variable > 0 to avoid NaNs,
AG_GT_30 %>% filter(A1C > 0) %>%
# standardise A1C dividing by the total sum of A1C in the group
mutate(standardised_A1C = A1C / sum(A1C)) %>%
# and seelct only A1C and standardised A1C
select(A1C,standardised_A1C)
## # A tibble: 226 x 3
## # Groups: SPAGE [7]
## SPAGE A1C standardised_A1C
## <int> <dbl> <dbl>
## 1 29 5.0 0.03203075
## 2 28 5.2 0.03121248
## 3 27 4.8 0.03009404
## 4 24 5.1 0.03240152
## 5 30 4.3 0.02426637
## 6 30 5.3 0.02990971
## 7 28 4.7 0.02821128
## 8 30 4.7 0.02652370
## 9 23 4.9 0.02909739
## 10 30 5.2 0.02934537
## # ... with 216 more rows