Data science fundamentals 06: Basic inference and linear regression (Oct. 31, 2017)

Recorded Stream

Basic inference

We will first study basic ideas in statistical inference using our HANES data set.

  # Load the package RCurl
  library(RCurl)
  # Import the HANES data set from GitHub; break the string into two for readability
  # (Please note this readability aspect very carefully)
  URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
  URL_text_2 <- "/master/Datasets/HANES/NYC_HANES_DIAB.csv"
  # Paste it to constitute a single URL 
  URL <- paste(URL_text_1,URL_text_2, sep="")
  HANES <- read.csv(text=getURL(URL))
  # Rename the GENDER factor for identification
  HANES$GENDER <- factor(HANES$GENDER, labels=c("M","F"))
  # Rename the AGEGROUP factor for identification
  HANES$AGEGROUP <- factor(HANES$AGEGROUP, labels=c("20-39","40-59","60+"))
  # Rename the HSQ_1 factor for identification
  HANES$HSQ_1 <- factor(HANES$HSQ_1, labels=c("Excellent","Very Good","Good", "Fair", "Poor"))
  # Rename the DX_DBTS as a factor
  HANES$DX_DBTS <- factor(HANES$DX_DBTS, labels=c("DIAB","DIAB NO_DX","NO DIAB"))
  # Omit all NA from the data frame
  HANES <- na.omit(HANES)
  # Observe the structure
  str(HANES)

## 'data.frame':    1112 obs. of  23 variables:
##  $ KEY              : Factor w/ 1527 levels "133370A","133370B",..: 28 43 44 53 55 70 84 90 100 107 ...
##  $ GENDER           : Factor w/ 2 levels "M","F": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SPAGE            : int  29 28 27 24 30 26 31 32 34 32 ...
##  $ AGEGROUP         : Factor w/ 3 levels "20-39","40-59",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ HSQ_1            : Factor w/ 5 levels "Excellent","Very Good",..: 2 2 2 1 1 3 1 2 1 3 ...
##  $ UCREATININE      : int  105 53 314 105 163 150 46 36 177 156 ...
##  $ UALBUMIN         : num  0.707 1 8 4 3 2 2 0.707 4 3 ...
##  $ UACR             : num  0.00673 2 3 4 2 ...
##  $ MERCURYU         : num  0.37 0.106 0.487 2.205 0.979 ...
##  $ DX_DBTS          : Factor w/ 3 levels "DIAB","DIAB NO_DX",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ A1C              : num  5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 5.2 ...
##  $ CADMIUM          : num  0.2412 0.1732 0.0644 0.0929 0.1202 ...
##  $ LEAD             : num  1.454 1.019 0.863 1.243 0.612 ...
##  $ MERCURYTOTALBLOOD: num  2.34 2.57 1.32 14.66 2.13 ...
##  $ HDL              : int  42 51 42 61 52 50 57 56 42 44 ...
##  $ CHOLESTEROLTOTAL : int  184 157 145 206 120 155 156 235 156 120 ...
##  $ GLUCOSESI        : num  4.61 4.77 5.16 5 5.11 ...
##  $ CREATININESI     : num  74.3 73 80 84.9 66 ...
##  $ CREATININE       : num  0.84 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 1.09 ...
##  $ TRIGLYCERIDE     : int  156 43 108 65 51 29 31 220 82 35 ...
##  $ GLUCOSE          : int  83 86 93 90 92 85 72 87 96 92 ...
##  $ COTININE         : num  31.5918 0.0635 0.035 0.0514 0.035 ...
##  $ LDLESTIMATE      : int  111 97 81 132 58 99 93 135 98 69 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:415] 2 15 16 24 26 28 33 34 35 39 ...
##   .. ..- attr(*, "names")= chr [1:415] "2" "15" "16" "24" ...

  # Load the tidyverse library
  library(tidyverse)
  # Set the seed for reproducibility
  set.seed(23960400)

Also, we need a package “rafalib” for these materials. So, lets go ahead and install that package, and load the library.

  # Install rafaliab library
  install.packages("rafalib")

  # Load rafalib library
  library(rafalib)

Z-Test

According to central limit theorem, if \(V\) and \(Z\) are independent and identically distributed random variables that represents a population, with mean \(\mu_{V}\) and \(\mu_{Z}\), and standard deviation, \(\sigma_{V}\) and \(\sigma_{Z}\), respectively, then the ratio,

\[ R = \frac{\mu_{Z}-\mu_{V}}{\sqrt{\frac{\sigma_Z^2}{N} + \frac{\sigma_V^2}{N}}} \] approaches normal distribution centered at \(0\) and standard deviation \(1\) as \(N\) becomes large. In general, we will not know this mean and standard deviation for the population but only for a random sample \(X\) and \(Y\) from the general population. In case \(V\) and \(Z\) are normally distributed, the approximation to the mean \(\mu_{V}\) and \(\mu_{Z}\) given by \(\hat{X}\) and \(\hat{Y}\) are also normally distributed, and so are their difference. Therefore, the ratio \(\hat{R}\), described by,

\[ \hat{R} = \frac{\hat{Y}-\hat{X}}{\sqrt{\frac{\sigma_X^2}{N} + \frac{\sigma_Y^2}{N}}} \]

can be used to compute p-values simply because we know the proportion of the distribution under any value. For example, only 7% of these values are larger than 1.8 (in absolute value)

  pnorm(-1.8) + (1 - pnorm(1.8))

## [1] 0.07186064

Thus, if we know:

the population is normally distributed,
the standard deviation of the population and
the sample size is large,

we can use this approximation. This is known as \(Z\)-test.

Most of the times, these conditions are hardly met. For the first condition that the population being normally distributed, although in practice we may not know the population distribution, we can see how close the sample distribution is close to normal distribution by plotting qq-plots in order to confirm that the distributions are relatively close to being normally distributed.

We will work with the blood hemoglobin variable, A1C. We will consider people without diabetes as control group, that represents the (sample) random variable \(Y\) and people with diabetes as disease group - that represents the (sample) random variable \(X\).

  # Extract only the A1C variable for the control group 
  control <- filter(HANES, DX_DBTS == "NO DIAB") %>% select(A1C)
  # and the disease group
  disease <- filter(HANES, DX_DBTS == "DIAB") %>% select(A1C)

We can now make a qqplot for the control variable

  # Find the 1st and 3rd quartiles
  y <- quantile(unlist(control), c(0.25, 0.75))
  # Find the matching normal values on the x-axis
  x <- qnorm( c(0.25, 0.75))
  # Compute the line slope
  slope <- diff(y) / diff(x) 
  # Compute the line intercept
  int <- y[1] - slope * x[1]
  # Generate normal q-q plot
  ggplot() + aes(sample=unlist(control)) + stat_qq(distribution=qnorm) + 
        geom_abline(intercept=int, slope=slope) + ylab("Sample") +
        labs(title = "QQ Plot for Control Variable")

Similarly we can make a qqplot for the disease variable.

  # Find the 1st and 3rd quartiles
  y <- quantile(unlist(disease), c(0.25, 0.75))
  # Find the matching normal values on the x-axis
  x <- qnorm( c(0.25, 0.75))
  # Compute the line slope
  slope <- diff(y) / diff(x) 
  # Compute the line intercept
  int <- y[1] - slope * x[1]
  # Generate normal q-q plot
  ggplot() + aes(sample=unlist(disease)) + stat_qq(distribution=qnorm) + 
        geom_abline(intercept=int, slope=slope) + ylab("Sample") +
        labs(title = "QQ Plot for Disease Variable")

We note that the control variable is more normal-like than the disease variable. Thus, a \(Z\)-test may not be appropriate. Even if the disease variable is normally distributed, a major disadvantage of a \(Z\)-test is that it requires the standard deviation of the population which we will not know in reality.

Classwork/Homework: Check if the GLUCOSE values for the non-diabetic male and female population are close to normal by constructing qqplots.

t-Test

Even if we don’t know the population standard deviations, we can estimate from the sample standard deviations:

\[ s_X^2 = \frac{1}{N-1} \sum_{i=1}^N (x_i - \hat{X})^2 \mbox{ and } s_Y^2 = \frac{1}{N-1} \sum_{i=1}^N (y_i - \hat{Y})^2 \]

to calculate the distribution of:

\[ t = \sqrt{N} \frac{\hat{Y}-\hat{X}}{\sqrt{s_{X}^{2}+s_{Y}^{2}}} \] which is approximately normal. This is called the \(t\)-statistic and the distribution is called Student’s t-distribution.

Although the disease variable in our case is not really normally distributed, had it been normally distributed, we can apply the t-test to find the p-value:

  # Apply t-test to disease vs. control population
  t.test(disease,control)

## 
##  Welch Two Sample t-test
## 
## data:  disease and control
## t = 12.989, df = 104.07, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.016100 2.742633
## sample estimates:
## mean of x mean of y 
##  7.664423  5.285056

Classwork/Homework: Perform t-test between male and female populations in non-diabetic sub population based on the GLUCOSE variable.

Confidence intervals

p-values are everywhere in life sciences.But it is not good to report only p-values for the simple reason that statistical significance doesn’t mean causality or scientific significance. A better alternative is to report confidence interval. Confidence intervals include information about the uncertainty associated with this estimate.

A 95% confidence interval (we can use percentages other than 95%) is a random interval with a 95% probability of falling on the parameter we are estimating. To construct it, we note that the central limit theorem tells us that \(\sqrt{N} (\bar{X}-\mu_X) / s_X\) follows a normal distribution with mean 0 and standard deviation 1. This implies that the probability of this event is (where \(p\) is the 95% intreval around 0):

\[ -p \leq \sqrt{N} (\bar{X}-\mu_X)/s_X \leq p \]

is about 95% because:

  # Calculate the probability of the above variable
  pnorm(2) - pnorm(-2)

## [1] 0.9544997

Now doing some algebra on the above inequality we see that:

\[ \bar{X}-p* s_X/\sqrt{N} \leq \mu_X \leq \bar{X}+p*s_X/\sqrt{N} \]

the chance of mean falling into the interval has a probability of 95%. Thus, we can compute the above estimate:

  # Find the estimate of s_{X}/sqrt(N)
  se <- sd(disease$A1C)/sqrt(length(disease$A1C))
  # Find the 95% semi-interval around the normal distribution
  p <- qnorm(1- 0.05/2)
  # Compute the confidence interval
  interval <- c(mean(disease$A1C)-p*se, mean(disease$A1C)+p*se)
  # Check if the mean of the disease population is in the confidence interval 
  interval[1] < mean(disease$A1C) & interval[2] > mean(disease$A1C)

## [1] TRUE

Classwork/Homework: Report the confidence interval for the mean GLUCOSE variable for both female and male population for the non-diabetic sub population. Do they overlap?

Association tests

Several life science and biomedical data sets come not just with continuous variables, but with categorical, binary and ordinal variables. For example, a locus in the genome has a mutation or not, is a binary variable on a set of loci. Association tests are extremely useful in detrmining whether such variables exhibit association between the categories. For example, lets consider the MIMIC3 admissions data:

  # Load the package RCurl
  library(RCurl)
  # Import the admissions data set in MIMIC3 from GitHub; 
  URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
  URL_text_2 <- "/master/Datasets/MIMIC3/admissions.csv"
  URL <- paste(URL_text_1,URL_text_2, sep="")
  MIMIC3_ADM <- read.csv(text=getURL(URL))
  # Observe the structure
  str(MIMIC3_ADM)

## 'data.frame':    5770 obs. of  19 variables:
##  $ row_id              : int  83 84 85 86 87 88 89 90 91 92 ...
##  $ subject_id          : int  82 83 84 84 85 85 86 87 88 89 ...
##  $ hadm_id             : int  110641 158569 120969 166401 116630 112077 190243 190659 123010 188646 ...
##  $ admittime           : Factor w/ 5310 levels "2100-07-04","2100-07-09",..: 2665 2240 5030 5037 3331 3607 2458 4778 569 4458 ...
##  $ dischtime           : Factor w/ 5767 levels "2100-07-17 15:00:00",..: 2878 2415 5462 5469 3603 3901 2651 5204 609 4834 ...
##  $ deathtime           : Factor w/ 569 levels "2100-09-13 21:20:00",..: NA NA NA 538 NA NA NA NA NA NA ...
##  $ admission_type      : Factor w/ 4 levels "ELECTIVE","EMERGENCY",..: 3 4 1 2 2 2 1 3 2 3 ...
##  $ admission_location  : Factor w/ 9 levels "** INFO NOT AVAILABLE **",..: 5 6 5 3 2 2 5 5 3 5 ...
##  $ discharge_location  : Factor w/ 16 levels "DEAD/EXPIRED",..: 5 6 5 1 14 16 6 15 5 15 ...
##  $ insurance           : Factor w/ 5 levels "Government","Medicaid",..: 4 3 4 4 3 3 4 4 4 2 ...
##  $ language            : Factor w/ 28 levels "*ARM","*CDI",..: NA NA NA NA 11 11 NA NA NA NA ...
##  $ religion            : Factor w/ 17 levels "7TH DAY ADVENTIST",..: 17 17 13 13 3 3 12 17 NA 17 ...
##  $ marital_status      : Factor w/ 7 levels "DIVORCED","LIFE PARTNER",..: NA 3 3 3 3 3 5 NA NA NA ...
##  $ ethnicity           : Factor w/ 37 levels "AMERICAN INDIAN/ALASKA NATIVE",..: 27 32 33 33 33 33 33 32 10 32 ...
##  $ edregtime           : Factor w/ 3065 levels "2100-07-09 05:56:00",..: NA NA NA 2904 NA 2075 NA NA 316 NA ...
##  $ edouttime           : Factor w/ 3065 levels "2100-07-09 15:55:00",..: NA NA NA 2904 NA 2075 NA NA 316 NA ...
##  $ diagnosis           : Factor w/ 2313 levels " AORTIC ABDOMINAL ANEURYSM/SDA",..: 1505 450 1337 985 242 1642 330 1505 1920 1505 ...
##  $ hospital_expire_flag: int  0 0 0 1 0 0 0 0 0 0 ...
##  $ has_chartevents_data: int  1 1 0 1 1 1 1 1 1 1 ...

Let us say we want to find if the odds of finding separated black people with Medicaid are almost the same as the ones with Government insurance.To do this first we need to find the number of people who are separated, have Medicaid and Government insurance and find their ethinicity. Let’s do that:

  # Find the count of insurance and marital status
  insurance_marital.status <- MIMIC3_ADM %>% 
    group_by(insurance, marital_status) %>% count()
  insurance_marital.status
  # Find the number of people with medicaid
  folks_with_medicaid <- MIMIC3_ADM %>% 
    filter(insurance == "Medicaid") %>% count() 
  folks_with_medicaid$n
  # Find the ethinicities for the separated Medicaid group
  MIMIC3_ADM %>% 
    filter(insurance =="Medicaid" & marital_status == "SEPARATED") %>% 
    group_by(ethnicity) %>% count()
  # Find the ethinicities for the separated Government group
  MIMIC3_ADM %>% 
    filter(insurance =="Government" & marital_status == "SEPARATED") %>% 
    group_by(ethnicity) %>% count()

## # A tibble: 34 x 3
## # Groups:   insurance, marital_status [34]
##     insurance marital_status     n
##        <fctr>         <fctr> <int>
##  1 Government       DIVORCED    13
##  2 Government        MARRIED    33
##  3 Government      SEPARATED     3
##  4 Government         SINGLE    93
##  5 Government        WIDOWED     8
##  6 Government             NA    31
##  7   Medicaid       DIVORCED    38
##  8   Medicaid        MARRIED   100
##  9   Medicaid      SEPARATED    12
## 10   Medicaid         SINGLE   309
## # ... with 24 more rows

## [1] 622

## # A tibble: 4 x 2
## # Groups:   ethnicity [4]
##                ethnicity     n
##                   <fctr> <int>
## 1 BLACK/AFRICAN AMERICAN     3
## 2     HISPANIC OR LATINO     1
## 3                  OTHER     3
## 4                  WHITE     5

## # A tibble: 2 x 2
## # Groups:   ethnicity [2]
##                ethnicity     n
##                   <fctr> <int>
## 1 BLACK/AFRICAN AMERICAN     2
## 2     HISPANIC OR LATINO     1

In particular the question we ask about the data is the following: knowing that \(3\) out of \(5\) separated blacks have Medicaid insurance, and \(2\) out of \(5\) have Government insurance, can we REJECT the null hypothesis (\(0.05\) level) that the odds of finding separated blacks with Medicaid is the same as the odds of finding them with Government insurance? To answer this question we can perform Fisher’s exact test:

Fisher’s exact test

We set up the contingency table as follows:

  # Make a data table and define row names and column names
  tab <- matrix(c(3,2,9,1),2,2)
  colnames(tab)<-c("BLACK","NON-BLACK")
  rownames(tab)<-c("MEDICAID","GOVERNMENT")
  tab
  # Run Fisher's exact test
  fisher.test(tab)

##            BLACK NON-BLACK
## MEDICAID       3         9
## GOVERNMENT     2         1

## 
##  Fisher's Exact Test for Count Data
## 
## data:  tab
## p-value = 0.2418
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.002558237 4.897124851
## sample estimates:
## odds ratio 
##  0.1920886

The p-value of \(0.24\) suggests that null hypothesis holds and that the odds of finding separated black people with Medicaid are almost the same with Government insurance.

Chi-square test

In the above example we had a sample size of \(15\). If the sample sizes are large, chi-square test gives a good approximation for the confidence intervals in hypothesis testing that will allow us to determine association. Let’s use chi-square test to see if there is an association between ethnicity and dignosis of a disease. We will consider two ethnicities - WHITE and BLACK and the diagnosis of DIABETIC KETOACIDOSIS.

  # Find the diagnosis and ethnicity info by grouping 
  diagnosis_ethnicity <- MIMIC3_ADM %>% 
    group_by(diagnosis, ethnicity) %>% count() %>% arrange(desc(n))
  # Display the table
  diagnosis_ethnicity
  # Select only the required ethinicity
  MIMIC3_ADM %>% 
    filter((ethnicity == "BLACK/AFRICAN AMERICAN" | ethnicity == "WHITE") & 
             diagnosis == "DIABETIC KETOACIDOSIS") %>% 
              group_by(ethnicity,diagnosis) %>% count()
  # Find the count of the population
  black_pop <- MIMIC3_ADM %>% filter(ethnicity == "BLACK/AFRICAN AMERICAN") %>% count()
  white_pop <- MIMIC3_ADM %>% filter(ethnicity == "WHITE") %>% count()
  black_pop$n
  white_pop$n

## # A tibble: 2,861 x 3
## # Groups:   diagnosis, ethnicity [2,861]
##                                                    diagnosis
##                                                       <fctr>
##  1                                                   NEWBORN
##  2                                                 PNEUMONIA
##  3                                                    SEPSIS
##  4                                                   NEWBORN
##  5                                                   NEWBORN
##  6 CORONARY ARTERY DISEASE\CORONARY ARTERY BYPASS GRAFT /SDA
##  7                                   INTRACRANIAL HEMORRHAGE
##  8                                   CORONARY ARTERY DISEASE
##  9                                    GASTROINTESTINAL BLEED
## 10                                     ALTERED MENTAL STATUS
## # ... with 2,851 more rows, and 2 more variables: ethnicity <fctr>,
## #   n <int>

## # A tibble: 2 x 3
## # Groups:   ethnicity, diagnosis [2]
##                ethnicity             diagnosis     n
##                   <fctr>                <fctr> <int>
## 1 BLACK/AFRICAN AMERICAN DIABETIC KETOACIDOSIS    33
## 2                  WHITE DIABETIC KETOACIDOSIS    18

## [1] 543

## [1] 4000

From the above analysis we see that \(33\) and \(18\) patients are affected by DIABETIC KETOACIDOSIS from the black and white ethnicities. Also, from the total count we have 510 and 3982 are unaffected by the disease. This information would allow us to perform the chi-square test by forming a contingency table:

  # Make a data table and define row names and column names
  tab <- matrix(c(510,33,3982,18),2,2)
  colnames(tab)<-c("BLACK","WHITE")
  rownames(tab)<-c("Unaffected","DIAB.KET")
  tab
  # Run chi-square test
  chisq.test(tab)

##            BLACK WHITE
## Unaffected   510  3982
## DIAB.KET      33    18

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tab
## X-squared = 131.37, df = 1, p-value < 2.2e-16

Thus, from the small p-value we infer that there is a good association between DIABETIC KETOACIDOSIS and the black population. Aparantly, it appears that this disease is common among the black and hispanic populations as one can see from the reference.

Classwork/Homework: Perform chi-square test and report association (or not) between the variables of your choice in your favorite data set.

Other statistical tests

For seveal data sets the normal distribution may be too much of an imposition to run tests such as t-tests. In such cases, Wilcoxon signed rank test can be an alternative to a t-test. It is a non-paramertic method which can be used to test if an estimate is different from its true value.

Wilcoxon signed rank test

The basic idea is to 1) combine all the data, 2) turn the values into ranks, 3) separate them back into their groups, and 4) compute the sum or average rank and perform a test. In R wilcox.test() function be used to perform this test. Lets perform this test on MERCURYU variable on diabetic and non-diabetic populations.

  # Get the MERCURYU variable for the diabetic population
  mercuryu_diab <- HANES %>% filter(DX_DBTS == "DIAB") %>% select(MERCURYU)
  # Get the MERCURYU variable for the non-diabetic population
  mercuryu_non_diab <- HANES %>% filter(DX_DBTS == "NO DIAB") %>% select(MERCURYU)
  # Run Wilcoxon ranked test
  wilcox.test(unlist(mercuryu_diab), unlist(mercuryu_non_diab))

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  unlist(mercuryu_diab) and unlist(mercuryu_non_diab)
## W = 45818, p-value = 0.09662
## alternative hypothesis: true location shift is not equal to 0

Classwork/Homework: Perform Wilcoxon signed rank test on any two age groups for the CHOLESTEROLTOTAL variable. Do you see any significant difference?

There are other tests (than qqplots) to check if a sample/variable is drawn from a normal distribution or if two samples/variables are drawn from the same distribution.

Shapiro test

Shapiro test can be use to test if a variable is drawn from a normal distribution. The null hypothesis, \(H_{0}\), here is that the data is drawn from a normal distribution. So a low p-value would mean we need to reject \(H_{0}\). We will check if the TRIGLYCERIDE variable has a normal distribution.

  # Get the TRIGLYCERIDE variable in the HANES data set
  HANES_TRIGLYCERIDE <- HANES %>% select(TRIGLYCERIDE)
  # Perform Shapiro test
  shapiro.test(unlist(HANES_TRIGLYCERIDE))

## 
##  Shapiro-Wilk normality test
## 
## data:  unlist(HANES_TRIGLYCERIDE)
## W = 0.85149, p-value < 2.2e-16

The extremely low p-value suggests that the TRIGLYCERIDE variable does not have a normal distribution.

Kolmogorov And Smirnov test

Kolmogorov And Smirnov test can be applied to test if two samples/variables are drawn from the same disrtibution. The null hypothesis here is that two variables are drawn from the same distribution. We can check if the two variables TRIGLYCERIDE and COTININE have the same distribution.

  # Get the TRIGLYCERIDE variable in the HANES data set
  HANES_TRIGLYCERIDE <- HANES %>% select(TRIGLYCERIDE)
  # Get the COTININE variable in the HANES data set
  HANES_COTININE <- HANES %>% select(COTININE)
  # Perform Kolmogorov And Smirnov test
  ks.test(unlist(HANES_TRIGLYCERIDE), unlist(HANES_COTININE))

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  unlist(HANES_TRIGLYCERIDE) and unlist(HANES_COTININE)
## D = 0.80845, p-value < 2.2e-16
## alternative hypothesis: two-sided

According to this test, the difference between the two variables, TRIGLYCERIDE and COTININE, is significant enough (< 0.05) to say that they have different distributions.

Classwork/Homework: Perform ks-test on three different variable-pairs in the HANES dataset. Report the findings.

Correlation

Correlation tests for linear relationship between variables. The function cor.test(x,y) computes the correlation between two continuous variables \(x\) and \(y\). The null hypothesis is that there is no correlation between the variables. We saw in the exploratory data analysis section that the variables LDLESTIMATE and CHOLESTEROLTOTAL seemed to exhibit linear relationship. We will verify that here -

  # Get the LDLESTIMATE variable in the HANES data set
  HANES_LDLESTIMATE <- HANES %>% select(LDLESTIMATE)
  # Get the CHOLESTEROLTOTAL variable in the HANES data set
  HANES_CHOLESTEROLTOTAL <- HANES %>% select(CHOLESTEROLTOTAL)
  # Perform correlation test on the variables using cor.test
  cor.test(unlist(HANES_LDLESTIMATE), unlist(HANES_CHOLESTEROLTOTAL))

## 
##  Pearson's product-moment correlation
## 
## data:  unlist(HANES_LDLESTIMATE) and unlist(HANES_CHOLESTEROLTOTAL)
## t = 65.849, df = 1110, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8796454 0.9036752
## sample estimates:
##       cor 
## 0.8922906

We see that there is high degree of correlation between LDLESTIMATE and CHOLESTEROLTOTAL. There are other important tests. The packages lawstat and outliers has a good collection for testing and detecting outliers, respectively.

Classwork/Homework: Perform correlation tests on three different variable-pairs in the HANES dataset. Report the correlations.

Linear Regression

The aim of linear regression is to establish a linear relationship (in parameters) between the predictor variable(s) and the response variable, so that we can use this formula to estimate the value of the response \(Y\), when only the predictors \(X\)s values are known.

We saw that there is a good correlation between LDLESTIMATE and CHOLESTEROLTOTAL suggesting there is possibly a linear relationship between them. Assuming there is a linear relationship, is it possible to predict the values of CHOLESTEROLTOTAL when we have the values for LDLESTIMATE? We will try to do this in what follows. First we will get these two variables from the data set.

  # Get the LDLESTIMATE and CHOLESTEROLTOTAL variables from the HANES data set
  mydata <- HANES %>% select(LDLESTIMATE,CHOLESTEROLTOTAL)
  # Take 30 random points (subsample) for illustration purposes
  mysample <- mydata[sample(1:nrow(mydata), 30, replace=FALSE),]

Before fitting linear model, typically, for each of the independent variables, the following three plots are drawn to visualize the behavior:

Scatter plot: This will help to visualize linear relationship between the predictor and response variables.
Box plot: Help determine outliers. Having outliers in as a predictor can drastically affect the predictions as they can easily affect the slope of the line of best fit.
Density plot: To help us see the distribution of the predictor variable. Although normality of the variables is not needed for ordinary least square (OSL) regression, and only needed for the residuals, this is for sanity check to see that the distributions are not utterly skewed.

We will check these plots now.

  # Make a scatter plot of the samples between LDL and Cholesterol
  ggplot(mysample, aes(LDLESTIMATE, CHOLESTEROLTOTAL)) + geom_point() + geom_smooth()

The scatter plot along with the smoothing line above suggests a linearly increasing relationship between LDL and Cholesterol variables. This is a good thing, because, one of the underlying assumptions in linear regression is that the relationship between the response and predictor variables is linear and additive.

Now let’s do a box plot to determine if there are any major outliers.

  # We first gather the two variables to make a box plot in one plot
  LDL_CHOL_gather <- mysample %>% 
    gather(`LDLESTIMATE`, `CHOLESTEROLTOTAL`, key = "Variables", value = "Values")
  # View as a tibble
  as.tibble(LDL_CHOL_gather)
  # Make a box plot
  ggplot(LDL_CHOL_gather, aes(Variables, Values)) + 
    geom_boxplot(fill='cyan', color="darkred", alpha = 0.5)

## # A tibble: 60 x 2
##      Variables Values
##          <chr>  <int>
##  1 LDLESTIMATE    246
##  2 LDLESTIMATE    194
##  3 LDLESTIMATE     68
##  4 LDLESTIMATE     85
##  5 LDLESTIMATE     53
##  6 LDLESTIMATE    174
##  7 LDLESTIMATE    120
##  8 LDLESTIMATE     88
##  9 LDLESTIMATE    133
## 10 LDLESTIMATE    196
## # ... with 50 more rows

Thus, we see there aren’t any outliers. Next we will study the density plots for these variables.

  # Make a density plot for sanity check
 ggplot(LDL_CHOL_gather, aes(x=Values, fill=Variables)) +
  geom_density(alpha=0.5) + xlim(0,350)

The distribution doesn’t seem to be entirely skewed and reasonably alright. Thus, we may proceed with the linear model fitting since correlation and visual/graphical analysis appear to be okay. In R the function lm() calls linear model bulding function. The lm() function takes in two main arguments, namely formula and data. The data is typically a data frame and the formula is a object of class formula.

  # build linear regression model on the sampled data
  linearMod <- lm(CHOLESTEROLTOTAL ~ LDLESTIMATE, data=mysample)  
  print(linearMod)

## 
## Call:
## lm(formula = CHOLESTEROLTOTAL ~ LDLESTIMATE, data = mysample)
## 
## Coefficients:
## (Intercept)  LDLESTIMATE  
##      76.225        1.001

The above model establishes the linear relationship between the predictor variable LDLESTIMATE and the response variable CHOLESTEROLTOTAL in the form: \[ CHOLESTEROLTOTAL = Intercept+Coefficient[LDLESTIMATE]* LDLESTIMATE \] And thus here CHOLESTEROLTOTAL = 76.23+1*LDLESTIMATE. We can plot the relationship by making the size of the point to reflect the residual:

  # We first get the predicted values and residuals
  mysample$predicted <- predict(linearMod)
  mysample$residuals <- residuals(linearMod)

  # Use the lm function to smooth and plot the segment
  ggplot(mysample, aes(x = LDLESTIMATE, y = CHOLESTEROLTOTAL)) +
    geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +
    geom_segment(aes(xend = LDLESTIMATE, yend = predicted), alpha = .2) +
    # Color the residuals and adjust their size and remove the size legend
    geom_point(aes(color = abs(residuals), size = abs(residuals))) + 
    scale_color_continuous(low = "black", high = "red") +
    guides(color = FALSE, size = FALSE) + 
    # and finally plot the predicted values
    geom_point(aes(y = predicted), shape = 1) +
    theme_bw()

Now that we have a formula to predict the cholesterol values given the LDL values, before we can use it we need to ensure the statistics are okay. This can be checked using the summary function summary().

  # Summarize the predicted linear model
  summary(linearMod)

## 
## Call:
## lm(formula = CHOLESTEROLTOTAL ~ LDLESTIMATE, data = mysample)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.342  -7.123   2.644   7.426  28.660 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 76.22516    6.40138   11.91 1.79e-12 ***
## LDLESTIMATE  1.00087    0.04836   20.70  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.58 on 28 degrees of freedom
## Multiple R-squared:  0.9386, Adjusted R-squared:  0.9364 
## F-statistic: 428.3 on 1 and 28 DF,  p-value: < 2.2e-16

The summary statistics gives us several things. The following describes what they mean.

Formula Call

This is a straight-forward description of the call we issued.

Residuals

Residuals are the difference between the actual observed response values and the response values that the model predicted (distance to CHOLESTEROLTOTAL values from the predicted line in our case). The Residuals section of the summary breaks it down into 5 summary points, Min, 1Q, Median, 3Q and Max. The Median should theoretically be equal to zero as we will see in the assumptions section below. So the closer it is to zero, the better it is. This is the case in our example - 2.64

Coefficients

This section talks about the coefficients of the model. In a simple linear model, we have some number of unknown co-efficients (here it is two) that needs to be determined. The coefficient that is independent of the predictor variable is the intercept. The estimates of each of these coefficients come with statistics.

Coefficient - Estimate

This gives the estimated values of the coefficients. The estimate of the intercept means on an average the difference between the cholesterol variable and LDL is about 76.23. The LDLESTIMATE coefficent estimate gives the slope.

Coefficient - Standard Error

This measures the average amount that the coefficient estimates vary from the actual average value of our response variable. We would ideally want to see lower values relative to coefficents. This can be use to construct confidence intervals.

Coefficient - t value

The t-statistic for a coefficient is defined by,

\[ \text{t−statistic} = \frac{\text{Estimated β−coefficient}}{\text{Std.Error}} \]

Thus, we want to keep the error low and hence t-value is expected to be high for a good fit, as it is in our case.

Coefficient - \(Pr(>t)\)

\(Pr(>|t|)\) or \(p\)-value is the probability that we get a \(t\)-value as high or higher than the observed value when the null hypothesis (that there is no relationship) is true. So if the \(Pr(>|t|)\) is low, the coefficients are significant (significantly different from zero). If the \(Pr(>|t|)\) is high, the coefficients are not significant. In our case, they are extremely low suggesting that we can reject the null hypothesis that there is no relationship.

Residual Standard Error

The Residual Standard Error is the average amount that the response, here total cholesterol, will deviate from the true regression line. Mathematically,

\[ \text{Residual Standard Error} = \sqrt{ \frac{\sum_i{e_i^2}}{d.f.} } \] where \(d.f\) is the degree of freedom and \(e_{i}\)’s are residuals.

Before we look the other summary, we define the following:

If \(\hat{y}\) is the fitted value for the observation \(i\) and \(\bar{y}\) is the mean of \(Y\), the Sum of Squared Error, SSE, is defined by,

\[ \text{SSE} = \sum_{i}^{n} \left( y_{i} - \hat{y_{i}} \right) ^{2}, \]

and the Sum of Squared Total, SST, is given by,

\[ \text{SST} = \sum_{i}^{n} \left( y_{i} - \bar{y_{i}} \right) ^{2}. \]

Similarly, the Mean Squared Error, MSE, is defined as,

\[ \text{MSE} = \frac{SSE}{n-q} \] where \(n\) is the number of observations and \(q\) is the number of coefficients.

Multiple R-squared, Adjusted R-squared

The R-squared statistic provides a measure of how well the model is fitting the actual data. It is between 0 and 1 and represents how much of the variance is explained by the model. In our case, it is 0.94, meaning 94% of the variance is explained by the model. Mathematically,

\[ \text{R}^{2} = 1 - \frac{SSE}{SST} \] When we add more predictor variables, the \(R^2\) will naturally increase and Adjusted \(R^2\) takes into account the number of predicted variables. Adjusted \(R^2\) is defined by,

\[ \text{R}^{2}_{adj} = 1 - \frac{(n-1)MSE}{SST} \] where \(n\) is the number of observations.

F-Statistic

Finally the F-statistic is defined by,

\[ \text{F-statistic} = \frac{SST-SSE}{(q-1)MSE} \]

where further the F-statistic is from 1 the better it is.

Assumptions on linear regression

Although linear regression is the simplest form of regression, there are several assumptions that go into it. If any of the assumptions are voided, linear regression may not be the best model to fit the data. Although there are several other assumptions such as no auto correlation of the residuals, no multicollinearity etc., we list the main ones here.

Assumption 1

The linearity is with respect to the parameters and not with respect to the predictor variables. Thus,

\[ Y = \beta_0+\beta_1X+\beta_2X^2 \] is still a linear regression.

Assumption 2

The mean of the residuals is theoritically zero and practically should be close to zero.

Assumption 3

The residuals should be normally distributed. This is one of the key assumptions behind linear regression and can be verified by making a qqplot or drawing a normal distribution with mean 0 and standard deviation equaling the deviation of the residuals, as shown below:

  # Draw a normal distribution by taking the standard deviation of
  # of the residuals
  ggplot(data = data.frame(x = c(-70, 70)), aes(x)) +
  stat_function(fun = dnorm, n = 101, 
                args = list(mean = 0, sd = sd(linearMod$residuals)), 
                color="blue") + ylab("") +
  scale_y_continuous(breaks = NULL) +
  # Also plot the density of the residuals
  geom_density(data = linearMod, aes(x=linearMod$residuals),fill = "red", alpha=0.3)

Thus we have a near normality of the residuals validating the linear regression modeling.

Assumption 4

Homoscedasticity of residuals or equal variance. The residuals should not show different variance for different values of the predictors.

Although each one of these assumptions can be checked, there is a packaged called Global Validation of Linear Model Assumptions, or gvlma that can be used to check the assumptions of the linear model. Here we use that package to validate our model.

Note You may have to install the package.

  # Install the package
  install.packages("gvlma")
  # Load the library
  library(gvlma)
  # Check the assumptions of the linear model
  gvlma(linearMod)

## 
## Call:
## lm(formula = CHOLESTEROLTOTAL ~ LDLESTIMATE, data = mysample)
## 
## Coefficients:
## (Intercept)  LDLESTIMATE  
##      76.225        1.001  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = linearMod) 
## 
##                      Value p-value                Decision
## Global Stat        0.15713  0.9971 Assumptions acceptable.
## Skewness           0.01956  0.8888 Assumptions acceptable.
## Kurtosis           0.06241  0.8027 Assumptions acceptable.
## Link Function      0.04692  0.8285 Assumptions acceptable.
## Heteroscedasticity 0.02824  0.8665 Assumptions acceptable.

We see that the model assumptions are acceptable. Yay!

Classwork/Homework: Fit a linear model to any two variables of choice in the HANES/MIMIC3 data set, other than the one described above. Make a qqplot for the residuals and check the assumptions. If the model fails explain why it is the case. If the model succeeds report the validity of the assumptions.

Selected materials and references

R for Data Science - Introduction to Modeling

PH525x series Biomedical Data Science

Quick Guide: Interpreting Simple Linear Model Output in R

Linear Regression