We will first study basic ideas in statistical inference using our HANES data set.
# Load the package RCurl
library(RCurl)
# Import the HANES data set from GitHub; break the string into two for readability
# (Please note this readability aspect very carefully)
URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
URL_text_2 <- "/master/Datasets/HANES/NYC_HANES_DIAB.csv"
# Paste it to constitute a single URL
URL <- paste(URL_text_1,URL_text_2, sep="")
HANES <- read.csv(text=getURL(URL))
# Rename the GENDER factor for identification
HANES$GENDER <- factor(HANES$GENDER, labels=c("M","F"))
# Rename the AGEGROUP factor for identification
HANES$AGEGROUP <- factor(HANES$AGEGROUP, labels=c("20-39","40-59","60+"))
# Rename the HSQ_1 factor for identification
HANES$HSQ_1 <- factor(HANES$HSQ_1, labels=c("Excellent","Very Good","Good", "Fair", "Poor"))
# Rename the DX_DBTS as a factor
HANES$DX_DBTS <- factor(HANES$DX_DBTS, labels=c("DIAB","DIAB NO_DX","NO DIAB"))
# Omit all NA from the data frame
HANES <- na.omit(HANES)
# Observe the structure
str(HANES)
# Load the tidyverse library
library(tidyverse)
# Set the seed for reproducibility
set.seed(23960671)
## 'data.frame': 1112 obs. of 23 variables:
## $ KEY : Factor w/ 1527 levels "133370A","133370B",..: 28 43 44 53 55 70 84 90 100 107 ...
## $ GENDER : Factor w/ 2 levels "M","F": 1 1 1 1 1 1 1 1 1 1 ...
## $ SPAGE : int 29 28 27 24 30 26 31 32 34 32 ...
## $ AGEGROUP : Factor w/ 3 levels "20-39","40-59",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ HSQ_1 : Factor w/ 5 levels "Excellent","Very Good",..: 2 2 2 1 1 3 1 2 1 3 ...
## $ UCREATININE : int 105 53 314 105 163 150 46 36 177 156 ...
## $ UALBUMIN : num 0.707 1 8 4 3 2 2 0.707 4 3 ...
## $ UACR : num 0.00673 2 3 4 2 ...
## $ MERCURYU : num 0.37 0.106 0.487 2.205 0.979 ...
## $ DX_DBTS : Factor w/ 3 levels "DIAB","DIAB NO_DX",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ A1C : num 5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 5.2 ...
## $ CADMIUM : num 0.2412 0.1732 0.0644 0.0929 0.1202 ...
## $ LEAD : num 1.454 1.019 0.863 1.243 0.612 ...
## $ MERCURYTOTALBLOOD: num 2.34 2.57 1.32 14.66 2.13 ...
## $ HDL : int 42 51 42 61 52 50 57 56 42 44 ...
## $ CHOLESTEROLTOTAL : int 184 157 145 206 120 155 156 235 156 120 ...
## $ GLUCOSESI : num 4.61 4.77 5.16 5 5.11 ...
## $ CREATININESI : num 74.3 73 80 84.9 66 ...
## $ CREATININE : num 0.84 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 1.09 ...
## $ TRIGLYCERIDE : int 156 43 108 65 51 29 31 220 82 35 ...
## $ GLUCOSE : int 83 86 93 90 92 85 72 87 96 92 ...
## $ COTININE : num 31.5918 0.0635 0.035 0.0514 0.035 ...
## $ LDLESTIMATE : int 111 97 81 132 58 99 93 135 98 69 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:415] 2 15 16 24 26 28 33 34 35 39 ...
## .. ..- attr(*, "names")= chr [1:415] "2" "15" "16" "24" ...
Also, we need a package “rafalib” for these materials. So, lets go ahead and install that package, and load the library.
# Install rafaliab library
install.packages("rafalib")
# Load rafalib library
library(rafalib)
According to central limit theorem, if \(V\) and \(Z\) are independent and identically distributed random variables that represents a population
mean \(\mu_{V}\) and \(\mu_{Z}\), and
standard deviation, \(\sigma_{V}\) and \(\sigma_{Z}\), then the ratio,
\[ R = \frac{\mu_{Z}-\mu_{V}}{\sqrt{\frac{\sigma_Z^2}{N} + \frac{\sigma_V^2}{N}}} \] approaches normal distribution centered at \(0\) and standard deviation \(1\) as \(N\) becomes large.
In general, we will not know this mean and standard deviation for the population but only for a random sample \(X\) and \(Y\) from the general population.
In case \(V\) and \(Z\) are normally distributed, the approximation to the mean \(\mu_{V}\) and \(\mu_{Z}\) given by \(\hat{X}\) and \(\hat{Y}\) are also normally distributed, and so are their difference.
Therefore, the ratio \(\hat{R}\), described by,
\[ \hat{R} = \frac{\hat{Y}-\hat{X}}{\sqrt{\frac{\sigma_X^2}{N} + \frac{\sigma_Y^2}{N}}} \]
can be used to compute p-values simply because we know the proportion of the distribution under any value.
For example, only 7% of these values are larger than 1.8 (in absolute value)
pnorm(-1.8) + (1 - pnorm(1.8))
## [1] 0.07186064
Thus, if we know:
we can use this approximation. This is known as \(Z\)-test.
Most of the times, these conditions are hardly met.
For the first condition that the population being normally distributed, although in practice we may not know the population distribution:
This is done by plotting qq-plots
This is to confirm that the distributions are relatively close to being normally distributed.
We will work with the blood hemoglobin variable, A1C.
We will consider people without diabetes as control group, that represents the (sample) random variable \(Y\) and
people with diabetes as disease group - that represents the (sample) random variable \(X\).
# Extract only the A1C variable for the control group
control <- filter(HANES, DX_DBTS == "NO DIAB") %>% select(A1C)
# and the disease group
disease <- filter(HANES, DX_DBTS == "DIAB") %>% select(A1C)
as.tibble(control)
as.tibble(disease)
## # A tibble: 977 x 1
## A1C
## <dbl>
## 1 5.0
## 2 5.2
## 3 4.8
## 4 5.1
## 5 4.3
## 6 5.2
## 7 4.8
## 8 5.2
## 9 4.8
## 10 5.2
## # ... with 967 more rows
## # A tibble: 104 x 1
## A1C
## <dbl>
## 1 8.3
## 2 5.3
## 3 8.7
## 4 5.1
## 5 10.3
## 6 6.1
## 7 10.3
## 8 7.7
## 9 6.7
## 10 7.5
## # ... with 94 more rows
We can now make a qqplot for the control variable
# Find the 1st and 3rd quartiles
y <- quantile(unlist(control), c(0.25, 0.75))
# Find the matching normal values on the x-axis
x <- qnorm( c(0.25, 0.75))
# Compute the line slope
slope <- diff(y) / diff(x)
# Compute the line intercept
int <- y[1] - slope * x[1]
# Generate normal q-q plot
ggplot() + aes(sample=unlist(control)) + stat_qq(distribution=qnorm) +
geom_abline(intercept=int, slope=slope) + ylab("Sample") +
labs(title = "QQ Plot for Control Variable")
Similarly we can make a qqplot for the disease variable.
# Find the 1st and 3rd quartiles
y <- quantile(unlist(disease), c(0.25, 0.75))
# Find the matching normal values on the x-axis
x <- qnorm( c(0.25, 0.75))
# Compute the line slope
slope <- diff(y) / diff(x)
# Compute the line intercept
int <- y[1] - slope * x[1]
# Generate normal q-q plot
ggplot() + aes(sample=unlist(disease)) + stat_qq(distribution=qnorm) +
geom_abline(intercept=int, slope=slope) + ylab("Sample") +
labs(title = "QQ Plot for Disease Variable")
We note that the control variable is more normal-like than the disease variable.
Thus, a \(Z\)-test may not be appropriate.
Even if the disease variable is normally distributed, a major disadvantage of a \(Z\)-test is that it requires the standard deviation of the population which we will not know in reality.
Classwork/Homework: Check if the GLUCOSE values for the non-diabetic male and female population are close to normal by constructing qqplots.
Even if we don’t know the population standard deviations, we can estimate from the sample standard deviations:
\[ s_X^2 = \frac{1}{N-1} \sum_{i=1}^N (x_i - \hat{X})^2 \mbox{ and } s_Y^2 = \frac{1}{N-1} \sum_{i=1}^N (y_i - \hat{Y})^2 \]
to calculate the distribution of:
\[ t = \sqrt{N} \frac{\hat{Y}-\hat{X}}{\sqrt{s_{X}^{2}+s_{Y}^{2}}} \] which is approximately normal. This is called the \(t\)-statistic and the distribution is called Student’s t-distribution.
Although the disease variable in our case is not really normally distributed, had it been normally distributed, we can apply the t-test to find the p-value:
# Apply t-test to disease vs. control population
t.test(disease,control)
##
## Welch Two Sample t-test
##
## data: disease and control
## t = 12.989, df = 104.07, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.016100 2.742633
## sample estimates:
## mean of x mean of y
## 7.664423 5.285056
Classwork/Homework: Perform t-test between male and female populations in non-diabetic sub population based on the GLUCOSE variable.
p-values are everywhere in life sciences. But it is not good to report only p-values for the simple reason that statistical significance doesn’t mean causality or scientific significance.
A better alternative is to report confidence interval.
Confidence intervals include information about the uncertainty associated with this estimate.
A 95% confidence interval (we can use percentages other than 95%) is a random interval with a 95% probability of falling on the parameter we are estimating.
To construct it, we note that the central limit theorem tells us that \(\sqrt{N} (\bar{X}-\mu_X) / s_X\) follows a normal distribution with mean 0 and standard deviation 1.
This implies that the probability of this event is (where \(p\) is the 95% intreval around 0):
\[ -p \leq \sqrt{N} (\bar{X}-\mu_X)/s_X \leq p \]
is about 95% because:
# Calculate the probability of the above variable
pnorm(2) - pnorm(-2)
## [1] 0.9544997
Now doing some algebra on the above inequality we see that:
\[ \bar{X}-p* s_X/\sqrt{N} \leq \mu_X \leq \bar{X}+p*s_X/\sqrt{N} \]
the chance of mean falling into the interval has a probability of 95%.
Thus, we can compute the above estimate:
# Find the estimate of s_{X}/sqrt(N)
se <- sd(disease$A1C)/sqrt(length(disease$A1C))
# Find the 95% semi-interval around the normal distribution
p <- qnorm(1- 0.05/2)
# Compute the confidence interval
interval <- c(mean(disease$A1C)-p*se, mean(disease$A1C)+p*se)
# Check if the mean of the disease population is in the confidence interval
interval[1] < mean(disease$A1C) & interval[2] > mean(disease$A1C)
## [1] TRUE
Classwork/Homework: Report the confidence interval for the mean GLUCOSE variable for both female and male population for the non-diabetic sub population. Do they overlap?
Several life science and biomedical data sets come not just with continuous variables, but with categorical, binary and ordinal variables.
For example, a locus in the genome has a mutation or not, is a binary variable on a set of loci.
Association tests are extremely useful in detrmining whether such variables exhibit association between the categories.
For example, lets consider the MIMIC3 admissions data:
# Load the package RCurl
library(RCurl)
# Import the admissions data set in MIMIC3 from GitHub;
URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
URL_text_2 <- "/master/Datasets/MIMIC3/admissions.csv"
URL <- paste(URL_text_1,URL_text_2, sep="")
MIMIC3_ADM <- read.csv(text=getURL(URL))
# Observe the structure
str(MIMIC3_ADM)
## 'data.frame': 5770 obs. of 19 variables:
## $ row_id : int 83 84 85 86 87 88 89 90 91 92 ...
## $ subject_id : int 82 83 84 84 85 85 86 87 88 89 ...
## $ hadm_id : int 110641 158569 120969 166401 116630 112077 190243 190659 123010 188646 ...
## $ admittime : Factor w/ 5310 levels "2100-07-04","2100-07-09",..: 2665 2240 5030 5037 3331 3607 2458 4778 569 4458 ...
## $ dischtime : Factor w/ 5767 levels "2100-07-17 15:00:00",..: 2878 2415 5462 5469 3603 3901 2651 5204 609 4834 ...
## $ deathtime : Factor w/ 569 levels "2100-09-13 21:20:00",..: NA NA NA 538 NA NA NA NA NA NA ...
## $ admission_type : Factor w/ 4 levels "ELECTIVE","EMERGENCY",..: 3 4 1 2 2 2 1 3 2 3 ...
## $ admission_location : Factor w/ 9 levels "** INFO NOT AVAILABLE **",..: 5 6 5 3 2 2 5 5 3 5 ...
## $ discharge_location : Factor w/ 16 levels "DEAD/EXPIRED",..: 5 6 5 1 14 16 6 15 5 15 ...
## $ insurance : Factor w/ 5 levels "Government","Medicaid",..: 4 3 4 4 3 3 4 4 4 2 ...
## $ language : Factor w/ 28 levels "*ARM","*CDI",..: NA NA NA NA 11 11 NA NA NA NA ...
## $ religion : Factor w/ 17 levels "7TH DAY ADVENTIST",..: 17 17 13 13 3 3 12 17 NA 17 ...
## $ marital_status : Factor w/ 7 levels "DIVORCED","LIFE PARTNER",..: NA 3 3 3 3 3 5 NA NA NA ...
## $ ethnicity : Factor w/ 37 levels "AMERICAN INDIAN/ALASKA NATIVE",..: 27 32 33 33 33 33 33 32 10 32 ...
## $ edregtime : Factor w/ 3065 levels "2100-07-09 05:56:00",..: NA NA NA 2904 NA 2075 NA NA 316 NA ...
## $ edouttime : Factor w/ 3065 levels "2100-07-09 15:55:00",..: NA NA NA 2904 NA 2075 NA NA 316 NA ...
## $ diagnosis : Factor w/ 2313 levels " AORTIC ABDOMINAL ANEURYSM/SDA",..: 1505 450 1337 985 242 1642 330 1505 1920 1505 ...
## $ hospital_expire_flag: int 0 0 0 1 0 0 0 0 0 0 ...
## $ has_chartevents_data: int 1 1 0 1 1 1 1 1 1 1 ...
Let us say we want to find if the odds of finding separated black people with Medicaid are almost the same as the ones with Government insurance.
To do this first we need to find the number of people who are separated, have Medicaid and Government insurance and find their ethinicity.
Let’s do that:
# Find the count of insurance and marital status
insurance_marital.status <- MIMIC3_ADM %>%
group_by(insurance, marital_status) %>% count()
insurance_marital.status
# Find the number of people with medicaid
folks_with_medicaid <- MIMIC3_ADM %>%
filter(insurance == "Medicaid") %>% count()
folks_with_medicaid$n
# Find the ethinicities for the separated Medicaid group
MIMIC3_ADM %>%
filter(insurance =="Medicaid" & marital_status == "SEPARATED") %>%
group_by(ethnicity) %>% count()
# Find the ethinicities for the separated Government group
MIMIC3_ADM %>%
filter(insurance =="Government" & marital_status == "SEPARATED") %>%
group_by(ethnicity) %>% count()
## # A tibble: 34 x 3
## # Groups: insurance, marital_status [34]
## insurance marital_status n
## <fctr> <fctr> <int>
## 1 Government DIVORCED 13
## 2 Government MARRIED 33
## 3 Government SEPARATED 3
## 4 Government SINGLE 93
## 5 Government WIDOWED 8
## 6 Government NA 31
## 7 Medicaid DIVORCED 38
## 8 Medicaid MARRIED 100
## 9 Medicaid SEPARATED 12
## 10 Medicaid SINGLE 309
## # ... with 24 more rows
## [1] 622
## # A tibble: 4 x 2
## # Groups: ethnicity [4]
## ethnicity n
## <fctr> <int>
## 1 BLACK/AFRICAN AMERICAN 3
## 2 HISPANIC OR LATINO 1
## 3 OTHER 3
## 4 WHITE 5
## # A tibble: 2 x 2
## # Groups: ethnicity [2]
## ethnicity n
## <fctr> <int>
## 1 BLACK/AFRICAN AMERICAN 2
## 2 HISPANIC OR LATINO 1
In particular the question we ask about the data is the following:
knowing that \(3\) out of \(5\) separated blacks have Medicaid insurance, and
\(2\) out of \(5\) have Government insurance,
can we REJECT the null hypothesis (\(0.05\) level) that the odds of finding separated blacks with Medicaid is the same as the odds of finding them with Government insurance?
To answer this question we can perform Fisher’s exact test:
Fisher’s exact test
We set up the contingency table as follows:
# Make a data table and define row names and column names
tab <- matrix(c(3,2,9,1),2,2)
colnames(tab)<-c("BLACK","NON-BLACK")
rownames(tab)<-c("MEDICAID","GOVERNMENT")
tab
# Run Fisher's exact test
fisher.test(tab)
## BLACK NON-BLACK
## MEDICAID 3 9
## GOVERNMENT 2 1
##
## Fisher's Exact Test for Count Data
##
## data: tab
## p-value = 0.2418
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.002558237 4.897124851
## sample estimates:
## odds ratio
## 0.1920886
The p-value of \(0.24\) suggests that null hypothesis holds and that the odds of finding separated black people with Medicaid are almost the same with Government insurance.
Chi-square test
In the above example we had a sample size of \(15\).
If the sample sizes are large, chi-square test gives a good approximation for the confidence intervals in hypothesis testing that will allow us to determine association.
Let’s use chi-square test to see if there is an association between ethnicity and dignosis of a disease.
We will consider two ethnicities - WHITE and BLACK and the diagnosis of DIABETIC KETOACIDOSIS.
# Find the diagnosis and ethnicity info by grouping
diagnosis_ethnicity <- MIMIC3_ADM %>%
group_by(diagnosis, ethnicity) %>% count() %>% arrange(desc(n))
# Display the table
diagnosis_ethnicity
# Select only the required ethinicity
MIMIC3_ADM %>%
filter((ethnicity == "BLACK/AFRICAN AMERICAN" | ethnicity == "WHITE") &
diagnosis == "DIABETIC KETOACIDOSIS") %>%
group_by(ethnicity,diagnosis) %>% count()
# Find the count of the population
black_pop <- MIMIC3_ADM %>% filter(ethnicity == "BLACK/AFRICAN AMERICAN") %>% count()
white_pop <- MIMIC3_ADM %>% filter(ethnicity == "WHITE") %>% count()
black_pop$n
white_pop$n
## # A tibble: 2,861 x 3
## # Groups: diagnosis, ethnicity [2,861]
## diagnosis
## <fctr>
## 1 NEWBORN
## 2 PNEUMONIA
## 3 SEPSIS
## 4 NEWBORN
## 5 NEWBORN
## 6 CORONARY ARTERY DISEASE\CORONARY ARTERY BYPASS GRAFT /SDA
## 7 INTRACRANIAL HEMORRHAGE
## 8 CORONARY ARTERY DISEASE
## 9 GASTROINTESTINAL BLEED
## 10 ALTERED MENTAL STATUS
## # ... with 2,851 more rows, and 2 more variables: ethnicity <fctr>,
## # n <int>
## # A tibble: 2 x 3
## # Groups: ethnicity, diagnosis [2]
## ethnicity diagnosis n
## <fctr> <fctr> <int>
## 1 BLACK/AFRICAN AMERICAN DIABETIC KETOACIDOSIS 33
## 2 WHITE DIABETIC KETOACIDOSIS 18
## [1] 543
## [1] 4000
From the above analysis we see that \(33\) and \(18\) patients are affected by DIABETIC KETOACIDOSIS from the black and white ethnicities.
Also, from the total count we have 510 and 3982 are unaffected by the disease.
This information would allow us to perform the chi-square test by forming a contingency table:
# Make a data table and define row names and column names
tab <- matrix(c(510,33,3982,18),2,2)
colnames(tab)<-c("BLACK","WHITE")
rownames(tab)<-c("Unaffected","DIAB.KET")
tab
# Run chi-square test
chisq.test(tab)
## BLACK WHITE
## Unaffected 510 3982
## DIAB.KET 33 18
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab
## X-squared = 131.37, df = 1, p-value < 2.2e-16
Thus, from the small p-value we infer that there is a good association between DIABETIC KETOACIDOSIS and the black population.
Aparantly, it appears that this disease is common among the black and hispanic populations as one can see from the reference.
Classwork/Homework: Perform chi-square test and report association (or not) between the variables of your choice in your favorite data set.
For seveal data sets the normal distribution may be too much of an imposition to run tests such as t-tests.
In such cases, Wilcoxon signed rank test can be an alternative to a t-test.
It is a non-paramertic method which can be used to test if an estimate is different from its true value.
Wilcoxon signed rank test
The basic idea is to
combine all the data,
turn the values into ranks,
separate them back into their groups, and
compute the sum or average rank and perform a test.
In R wilcox.test()
function be used to perform this test.
Lets perform this test on MERCURYU variable on diabetic and non-diabetic populations.
# Get the MERCURYU variable for the diabetic population
mercuryu_diab <- HANES %>% filter(DX_DBTS == "DIAB") %>% select(MERCURYU)
# Get the MERCURYU variable for the non-diabetic population
mercuryu_non_diab <- HANES %>% filter(DX_DBTS == "NO DIAB") %>% select(MERCURYU)
# Run Wilcoxon ranked test
wilcox.test(unlist(mercuryu_diab), unlist(mercuryu_non_diab))
##
## Wilcoxon rank sum test with continuity correction
##
## data: unlist(mercuryu_diab) and unlist(mercuryu_non_diab)
## W = 45818, p-value = 0.09662
## alternative hypothesis: true location shift is not equal to 0
Classwork/Homework: Perform Wilcoxon signed rank test on any two age groups for the CHOLESTEROLTOTAL variable. Do you see any significant difference?
There are other tests (than qqplots) to check if a sample/variable is drawn from a normal distribution or if two samples/variables are drawn from the same distribution.
Shapiro test
Shapiro test can be use to test if a variable is drawn from a normal distribution.
The null hypothesis, \(H_{0}\), here is that the data is drawn from a normal distribution. So a low p-value would mean we need to reject \(H_{0}\).
We will check if the TRIGLYCERIDE variable has a normal distribution.
# Get the TRIGLYCERIDE variable in the HANES data set
HANES_TRIGLYCERIDE <- HANES %>% select(TRIGLYCERIDE)
# Perform Shapiro test
shapiro.test(unlist(HANES_TRIGLYCERIDE))
##
## Shapiro-Wilk normality test
##
## data: unlist(HANES_TRIGLYCERIDE)
## W = 0.85149, p-value < 2.2e-16
The extremely low p-value suggests that the TRIGLYCERIDE variable does not have a normal distribution.
The null hypothesis here is that two variables are drawn from the same distribution.
Kolmogorov And Smirnov test
Kolmogorov And Smirnov test can be applied to test if two samples/variables are drawn from the same disrtibution.
We can check if two variables TRIGLYCERIDE and COTININE have same distribution.
# Get the TRIGLYCERIDE variable in the HANES data set
HANES_TRIGLYCERIDE <- HANES %>% select(TRIGLYCERIDE)
# Get the COTININE variable in the HANES data set
HANES_COTININE <- HANES %>% select(COTININE)
# Perform Kolmogorov And Smirnov test
ks.test(unlist(HANES_TRIGLYCERIDE), unlist(HANES_COTININE))
##
## Two-sample Kolmogorov-Smirnov test
##
## data: unlist(HANES_TRIGLYCERIDE) and unlist(HANES_COTININE)
## D = 0.80845, p-value < 2.2e-16
## alternative hypothesis: two-sided
According to this test, the difference between the two variables, TRIGLYCERIDE and COTININE, is significant enough (< 0.05) to say that they have different distributions.
Classwork/Homework: Perform ks-test on three different variable-pairs in the HANES dataset. Report the findings.
Correlation tests for linear relationship between variables.
The function cor.test(x,y)
computes the correlation between two continuous variables \(x\) and \(y\).
The null hypothesis is that there is no correlation between the variables.
We saw in the exploratory data analysis section that the variables LDLESTIMATE and CHOLESTEROLTOTAL seemed to exhibit linear relationship.
We will verify that here -
# Get the LDLESTIMATE variable in the HANES data set
HANES_LDLESTIMATE <- HANES %>% select(LDLESTIMATE)
# Get the CHOLESTEROLTOTAL variable in the HANES data set
HANES_CHOLESTEROLTOTAL <- HANES %>% select(CHOLESTEROLTOTAL)
# Perform correlation test on the variables using cor.test
cor.test(unlist(HANES_LDLESTIMATE), unlist(HANES_CHOLESTEROLTOTAL))
##
## Pearson's product-moment correlation
##
## data: unlist(HANES_LDLESTIMATE) and unlist(HANES_CHOLESTEROLTOTAL)
## t = 65.849, df = 1110, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8796454 0.9036752
## sample estimates:
## cor
## 0.8922906
We see that there is high degree of correlation between LDLESTIMATE and CHOLESTEROLTOTAL.
There are other important tests. The packages lawstat
and outliers
has a good collection for testing and detecting outliers, respectively.
Classwork/Homework: Perform correlation tests on three different variable-pairs in the HANES dataset. Report the correlations.
The aim of linear regression is to establish a linear relationship (in parameters) between the predictor variable(s) and the response variable.
With such relationship we can use the formula to estimate the value of the response \(Y\), when only the predictors \(X\)s values are known.
We saw that there is a good correlation between LDLESTIMATE and CHOLESTEROLTOTAL suggesting there is possibly a linear relationship between them.
Assuming there is a linear relationship, is it possible to predict the values of CHOLESTEROLTOTAL when we have the values for LDLESTIMATE?
We will try to do this in what follows.
First we will get these two variables from the data set.
# Get the LDLESTIMATE and CHOLESTEROLTOTAL variables from the HANES data set
mydata <- HANES %>% select(LDLESTIMATE,CHOLESTEROLTOTAL)
# Take 30 random points (subsample) for illustration purposes
mysample <- mydata[sample(1:nrow(mydata), 30, replace=FALSE),]
as.tibble(mysample)
## # A tibble: 30 x 2
## LDLESTIMATE CHOLESTEROLTOTAL
## * <int> <int>
## 1 246 335
## 2 194 263
## 3 68 143
## 4 85 162
## 5 53 132
## 6 174 248
## 7 120 188
## 8 88 156
## 9 133 238
## 10 196 258
## # ... with 20 more rows
Before fitting linear model, typically, for each of the independent variables, the following three plots are drawn to visualize the behavior:
Scatter plot: This will help to visualize linear relationship between the predictor and response variables.
Box plot: Help determine outliers. Having outliers in as a predictor can drastically affect the predictions as they can easily affect the slope of the line of best fit.
Density plot: To help us see the distribution of the predictor variable. Although normality of the variables is not needed for ordinary least square (OSL) regression, and only needed for the residuals, this is for sanity check to see that the distributions are not utterly skewed.
We will check these plots now.
# Make a scatter plot of the samples between LDL and Cholesterol
ggplot(mysample, aes(LDLESTIMATE, CHOLESTEROLTOTAL)) + geom_point() + geom_smooth()
The scatter plot along with the smoothing line above suggests a linearly increasing relationship between LDL and Cholesterol variables.
This is a good thing, because, one of the underlying assumptions in linear regression is that the relationship between the response and predictor variables is linear and additive.
Now let’s do a box plot to determine if there are any major outliers.
# We first gather the two variables to make a box plot in one plot
LDL_CHOL_gather <- mysample %>%
gather(`LDLESTIMATE`, `CHOLESTEROLTOTAL`, key = "Variables", value = "Values")
# View as a tibble
as.tibble(LDL_CHOL_gather)
# Make a box plot
ggplot(LDL_CHOL_gather, aes(Variables, Values)) +
geom_boxplot(fill='cyan', color="darkred", alpha = 0.5)
## # A tibble: 60 x 2
## Variables Values
## <chr> <int>
## 1 LDLESTIMATE 246
## 2 LDLESTIMATE 194
## 3 LDLESTIMATE 68
## 4 LDLESTIMATE 85
## 5 LDLESTIMATE 53
## 6 LDLESTIMATE 174
## 7 LDLESTIMATE 120
## 8 LDLESTIMATE 88
## 9 LDLESTIMATE 133
## 10 LDLESTIMATE 196
## # ... with 50 more rows
Thus, we see there aren’t any outliers.
Next we will study the density plots for these variables.
# Make a density plot for sanity check
ggplot(LDL_CHOL_gather, aes(x=Values, fill=Variables)) +
geom_density(alpha=0.5) + xlim(0,350)
The distribution doesn’t seem to be entirely skewed and reasonably alright.
Thus, we may proceed with the linear model fitting since correlation and visual/graphical analysis appear to be okay.
In R the function lm()
calls linear model bulding function.
The lm()
function takes in two main arguments, namely formula and data.
The data is typically a data frame and the formula is a object of class formula
.
# build linear regression model on the sampled data
linearMod <- lm(CHOLESTEROLTOTAL ~ LDLESTIMATE, data=mysample)
print(linearMod)
##
## Call:
## lm(formula = CHOLESTEROLTOTAL ~ LDLESTIMATE, data = mysample)
##
## Coefficients:
## (Intercept) LDLESTIMATE
## 76.225 1.001
The above model establishes the linear relationship between the predictor variable LDLESTIMATE and the response variable CHOLESTEROLTOTAL in the form:
\[ CHOLESTEROLTOTAL = Intercept+Coefficient[LDLESTIMATE]* LDLESTIMATE \]
And thus here CHOLESTEROLTOTAL = 76.23+1*LDLESTIMATE.
We can plot the relationship by making the size of the point to reflect the residual:
# We first get the predicted values and residuals
mysample$predicted <- predict(linearMod)
mysample$residuals <- residuals(linearMod)
# Use the lm function to smooth and plot the segment
ggplot(mysample, aes(x = LDLESTIMATE, y = CHOLESTEROLTOTAL)) +
geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
geom_segment(aes(xend = LDLESTIMATE, yend = predicted), alpha = .2) +
# Color the residuals and adjust their size and remove the size legend
geom_point(aes(color = abs(residuals), size = abs(residuals))) +
scale_color_continuous(low = "black", high = "red") +
guides(color = FALSE, size = FALSE) +
# and finally plot the predicted values
geom_point(aes(y = predicted), shape = 1) +
theme_bw()
Now that we have a formula to predict the cholesterol values given the LDL values, before we can use it we need to ensure the statistics are okay.
This can be checked using the summary function summary()
.
# Summarize the predicted linear model
summary(linearMod)
##
## Call:
## lm(formula = CHOLESTEROLTOTAL ~ LDLESTIMATE, data = mysample)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.342 -7.123 2.644 7.426 28.660
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76.22516 6.40138 11.91 1.79e-12 ***
## LDLESTIMATE 1.00087 0.04836 20.70 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.58 on 28 degrees of freedom
## Multiple R-squared: 0.9386, Adjusted R-squared: 0.9364
## F-statistic: 428.3 on 1 and 28 DF, p-value: < 2.2e-16
The summary statistics gives us several things.
The following describes what they mean.
Formula Call
This is a straight-forward description of the call we issued.
Residuals
Residuals are the difference between the actual observed response values and the response values that the model predicted (distance to CHOLESTEROLTOTAL values from the predicted line in our case).
The Residuals section of the summary breaks it down into 5 summary points, Min, 1Q, Median, 3Q and Max.
The Median should theoretically be equal to zero as we will see in the assumptions section below.
So the closer it is to zero, the better it is.
This is the case in our example - 2.64
Coefficients
This section talks about the coefficients of the model.
In a simple linear model, we have some number of unknown co-efficients (here it is two) that needs to be determined.
The coefficient that is independent of the predictor variable is the intercept.
The estimates of each of these coefficients come with statistics.
Coefficient - Estimate
This gives the estimated values of the coefficients.
The estimate of the intercept means on an average the difference between the cholesterol variable and LDL is about 76.23.
The LDLESTIMATE coefficent estimate gives the slope.
Coefficient - Standard Error
This measures the average amount that the coefficient estimates vary from the actual average value of our response variable.
We would ideally want to see lower values relative to coefficents.
This can be use to construct confidence intervals.
Coefficient - t value
The t-statistic for a coefficient is defined by,
\[ \text{t−statistic} = \frac{\text{Estimated β−coefficient}}{\text{Std.Error}} \]
Thus, we want to keep the error low and hence t-value is expected to be high for a good fit, as it is in our case.
Coefficient - \(Pr(>t)\)
\(Pr(>|t|)\) or \(p\)-value is the probability that we get a \(t\)-value as high or higher than the observed value when the null hypothesis (that there is no relationship) is true.
So if the \(Pr(>|t|)\) is low, the coefficients are significant (significantly different from zero).
If the \(Pr(>|t|)\) is high, the coefficients are not significant.
In our case, they are extremely low suggesting that we can reject the null hypothesis that there is no relationship.
Residual Standard Error
The Residual Standard Error is the average amount that the response, here total cholesterol, will deviate from the true regression line.
Mathematically,
\[ \text{Residual Standard Error} = \sqrt{ \frac{\sum_i{e_i^2}}{d.f.} } \]
where \(d.f\) is the degree of freedom and \(e_{i}\)’s are residuals.
Before we look the other summary, we define the following:
If \(\hat{y}\) is the fitted value for the observation \(i\) and \(\bar{y}\) is the mean of \(Y\), the Sum of Squared Error, SSE, is defined by,
\[ \text{SSE} = \sum_{i}^{n} \left( y_{i} - \hat{y_{i}} \right) ^{2}, \]
and the Sum of Squared Total, SST, is given by,
\[ \text{SST} = \sum_{i}^{n} \left( y_{i} - \bar{y_{i}} \right) ^{2}. \]
Similarly, the Mean Squared Error, MSE, is defined as,
\[ \text{MSE} = \frac{SSE}{n-q} \]
where \(n\) is the number of observations and \(q\) is the number of coefficients.
Multiple R-squared, Adjusted R-squared
The R-squared statistic provides a measure of how well the model is fitting the actual data.
It is between 0 and 1 and represents how much of the variance is explained by the model.
In our case, it is 0.94, meaning 94% of the variance is explained by the model.
Mathematically,
\[ \text{R}^{2} = 1 - \frac{SSE}{SST} \]
When we add more predictor variables, the \(R^2\) will naturally increase and Adjusted \(R^2\) takes into account the number of predicted variables.
Adjusted \(R^2\) is defined by,
\[ \text{R}^{2}_{adj} = 1 - \frac{(n-1)MSE}{SST} \]
where \(n\) is the number of observations.
F-Statistic
Finally the F-statistic is defined by,
\[ \text{F-statistic} = \frac{SST-SSE}{(q-1)MSE} \]
where further the F-statistic is from 1 the better it is.
Although linear regression is the simplest form of regression, there are several assumptions that go into it.
If any of the assumptions are voided, linear regression may not be the best model to fit the data.
Although there are several other assumptions such as no auto correlation of the residuals, no multicollinearity etc., we list the main ones here.
The linearity is with respect to the parameters and not with respect to the predictor variables. Thus,
\[ Y = \beta_0+\beta_1X+\beta_2X^2 \] is still a linear regression.
The mean of the residuals is theoritically zero and practically should be close to zero.
The residuals should be normally distributed.
This is one of the key assumptions behind linear regression and can be verified by making a qqplot or drawing a normal distribution with mean 0 and standard deviation equaling the deviation of the residuals, as shown below:
# Draw a normal distribution by taking the standard deviation of
# of the residuals
ggplot(data = data.frame(x = c(-70, 70)), aes(x)) +
stat_function(fun = dnorm, n = 101,
args = list(mean = 0, sd = sd(linearMod$residuals)),
color="blue") + ylab("") +
scale_y_continuous(breaks = NULL) +
# Also plot the density of the residuals
geom_density(data = linearMod, aes(x=linearMod$residuals),fill = "red", alpha=0.3)
Thus we have a near normality of the residuals validating the linear regression modeling.
Homoscedasticity of residuals or equal variance. The residuals should not show different variance for different values of the predictors.
Although each one of these assumptions can be checked, there is a packaged -
gvlma
that can be used to check the assumptions of the linear model.Here we use that package to validate our model.
Note You may have to install the package.
# Install the package
install.packages("gvlma")
# Load the library
library(gvlma)
# Check the assumptions of the linear model
gvlma(linearMod)
##
## Call:
## lm(formula = CHOLESTEROLTOTAL ~ LDLESTIMATE, data = mysample)
##
## Coefficients:
## (Intercept) LDLESTIMATE
## 76.225 1.001
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = linearMod)
##
## Value p-value Decision
## Global Stat 0.15713 0.9971 Assumptions acceptable.
## Skewness 0.01956 0.8888 Assumptions acceptable.
## Kurtosis 0.06241 0.8027 Assumptions acceptable.
## Link Function 0.04692 0.8285 Assumptions acceptable.
## Heteroscedasticity 0.02824 0.8665 Assumptions acceptable.
We see that the model assumptions are acceptable. Yay!
Classwork/Homework: Fit a linear model to any two variables of choice in the HANES/MIMIC3 data set, other than the one described above.
Make a qqplot
for the residuals and check the assumptions.
If the model fails explain why it is the case. If the model succeeds report the validity of the assumptions.
R for Data Science - Introduction to Modeling
PH525x series Biomedical Data Science