In the previous lectures we saw basic inference and linear regression. Linear regression as we saw comes with several assumptions, one of which is that the errors (or residuals) should be normally distributed. Also, the dependent/response variable is a continous variable. In several applications and in biomedicine the response variable is a categorical or a binary variable. For instance, a patient in an emergency ward could be dead or alive. Like wise a person in the HANES dataset could be diabetic, or not. In such cases, we cannot use simple linear regression to predict the outcomes. We thus have to use a special type of regression known as logistic regression. These models are captured under Generalized Linear Model (GLM) which is a flexible framework to handle such response variables. Before we see a specific example of how we apply logistic regression to our HANES dataset in R, we will watch this video that explains this regression technique:
Previously in the HANES data set, we saw that there were two clusters when we log transform the A1C and UACR variables (Data science fundamentals 01).
# Load the package RCurl
library(RCurl)
# Import the HANES data set from GitHub; break the string into two for readability
# (Please note this readability aspect very carefully)
URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
URL_text_2 <- "/master/Datasets/HANES/NYC_HANES_DIAB.csv"
# Paste it to constitute a single URL
URL <- paste(URL_text_1,URL_text_2, sep="")
HANES <- read.csv(text=getURL(URL))
# Rename the GENDER factor for identification
HANES$GENDER <- factor(HANES$GENDER, labels=c("M","F"))
# Rename the AGEGROUP factor for identification
HANES$AGEGROUP <- factor(HANES$AGEGROUP, labels=c("20-39","40-59","60+"))
# Rename the HSQ_1 factor for identification
HANES$HSQ_1 <- factor(HANES$HSQ_1, labels=c("Excellent","Very Good","Good", "Fair", "Poor"))
# Rename the DX_DBTS as a factor
HANES$DX_DBTS <- factor(HANES$DX_DBTS, labels=c("DIAB","DIAB NO_DX","NO DIAB"))
# Omit all NA from the data frame
HANES <- na.omit(HANES)
# Observe the structure
str(HANES)
# Load the tidyverse library
library(tidyverse)
# Make a ggplot for the log(A1C) and log(UACR) variables with asthetic color for the variable DX_DBTS
ggplot(data = HANES) +
geom_point(mapping = aes(x = log(A1C), y = log(UACR), color=DX_DBTS))
## 'data.frame': 1112 obs. of 23 variables:
## $ KEY : Factor w/ 1527 levels "133370A","133370B",..: 28 43 44 53 55 70 84 90 100 107 ...
## $ GENDER : Factor w/ 2 levels "M","F": 1 1 1 1 1 1 1 1 1 1 ...
## $ SPAGE : int 29 28 27 24 30 26 31 32 34 32 ...
## $ AGEGROUP : Factor w/ 3 levels "20-39","40-59",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ HSQ_1 : Factor w/ 5 levels "Excellent","Very Good",..: 2 2 2 1 1 3 1 2 1 3 ...
## $ UCREATININE : int 105 53 314 105 163 150 46 36 177 156 ...
## $ UALBUMIN : num 0.707 1 8 4 3 2 2 0.707 4 3 ...
## $ UACR : num 0.00673 2 3 4 2 ...
## $ MERCURYU : num 0.37 0.106 0.487 2.205 0.979 ...
## $ DX_DBTS : Factor w/ 3 levels "DIAB","DIAB NO_DX",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ A1C : num 5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 5.2 ...
## $ CADMIUM : num 0.2412 0.1732 0.0644 0.0929 0.1202 ...
## $ LEAD : num 1.454 1.019 0.863 1.243 0.612 ...
## $ MERCURYTOTALBLOOD: num 2.34 2.57 1.32 14.66 2.13 ...
## $ HDL : int 42 51 42 61 52 50 57 56 42 44 ...
## $ CHOLESTEROLTOTAL : int 184 157 145 206 120 155 156 235 156 120 ...
## $ GLUCOSESI : num 4.61 4.77 5.16 5 5.11 ...
## $ CREATININESI : num 74.3 73 80 84.9 66 ...
## $ CREATININE : num 0.84 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 1.09 ...
## $ TRIGLYCERIDE : int 156 43 108 65 51 29 31 220 82 35 ...
## $ GLUCOSE : int 83 86 93 90 92 85 72 87 96 92 ...
## $ COTININE : num 31.5918 0.0635 0.035 0.0514 0.035 ...
## $ LDLESTIMATE : int 111 97 81 132 58 99 93 135 98 69 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:415] 2 15 16 24 26 28 33 34 35 39 ...
## .. ..- attr(*, "names")= chr [1:415] "2" "15" "16" "24" ...
These two clusters are primarily composed of non-diabetic people. In the set of all non-diabetic people, if we call the lower cluster (i.e, log(UACR) <= -2
) as 0
and the upper cluster (i.e, log(UACR) >= -2
) as 1
(right now we will manually distingush the clusters, but in machine learning course we will see algorthms such as k-means will do this for us), out of UCREATININE and UALBUMIN variables that makes this ratio UACR which variable distingushes these two clusters? We can answer this question through logistic regression. But to do this, we will first extract this data using data transformation techniques we learnt.
# Extract only non-diabetic people, name the ones with log(UACR) <= -2 as 0
# and others 1 and exclude all other factor variables
mydata <- HANES %>% filter(DX_DBTS == "NO DIAB") %>%
mutate(Cluster = ifelse(log(UACR) <= -2, 0, 1)) %>%
select(everything(), -KEY, -GENDER, -AGEGROUP, -HSQ_1, -DX_DBTS)
# Make a ggplot for the log(A1C) and log(UACR) variables with asthetic color for the variable Cluster
ggplot(data = mydata) +
geom_point(mapping = aes(x = log(A1C), y = log(UACR), color=Cluster)) +
theme(legend.position="none")
We can now perform a logistic regression where the predictor variables will be UCREATININE and UALBUMIN and the response variable would be the cluster.
# Build a logistic regression model for the predictor variable UCREATININE
# with the response variable Cluster
logistic_model.UCREATININE <- glm(Cluster ~ UCREATININE, family=binomial(link='logit'),data=mydata)
# And summarize the model
summary(logistic_model.UCREATININE)
# Build a logistic regression model for the predictor variable UALBUMIN
# with the response variable Cluster
logistic_model.UALBUMIN <- glm(Cluster ~ UALBUMIN, family=binomial(link='logit'),data=mydata)
# and summarize the model
summary(logistic_model.UALBUMIN)
##
## Call:
## glm(formula = Cluster ~ UCREATININE, family = binomial(link = "logit"),
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0619 0.1999 0.3777 0.5927 1.0440
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.307339 0.185862 1.654 0.0982 .
## UCREATININE 0.014918 0.001835 8.132 4.23e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 747.32 on 976 degrees of freedom
## Residual deviance: 652.31 on 975 degrees of freedom
## AIC: 656.31
##
## Number of Fisher Scoring iterations: 6
##
## Call:
## glm(formula = Cluster ~ UALBUMIN, family = binomial(link = "logit"),
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9025 0.0000 0.0000 0.0000 0.1728
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -66.81 754.47 -0.089 0.929
## UALBUMIN 71.01 754.47 0.094 0.925
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 747.321 on 976 degrees of freedom
## Residual deviance: 20.819 on 975 degrees of freedom
## AIC: 24.819
##
## Number of Fisher Scoring iterations: 25
The results of the summary is interpreted as follows:
The intercept of the UCREATININE fit = 0.31 is interpreted as the log odds of a patient with a UCREATININE value of zero being in the class where log(UACR) <= -2
(that is in cluster 0).
The coefficient for UCREATININE = 0.01 which is interpreted as the expected change in log odds for a one-unit increase in the UCREATININE variable. The odds ratio can be calculated by exponentiating this value to get 1.02 which means we expect to see about 2% increase in the odds of being in cluster 0, for a one-unit increase in UCREATININE variable.
The negative intercept in UALBUMIN means that the log odds are negative and so the probability of separation is non-optimal. Coupled with the statistically significant coeffcient of the UCREATININE variable (which rejects the null hypothesis that the UCREATININE variable is not related to the clusters), we can say UCREATININE distingushes the cluster better.
Classwork/Homework: Perform logistic regression on diabetic vs. non-diabetic populations and report which variable(s) distingush the populations.
Like noted in the introduction, logistic regression is usually modeled using Generalized Linear Model (GLM) in R. GLM’s are usually solved using the method of maximum likelihood which in-turn uses optimization techniques like the gradient descent to find the parameters. We will study these modeling and optimization techniques now.
Likelihood is the probablity of observing the data given a model. In the univariate case, let us say that we have a set of data \(X = \{x_1,x_2,\dots,x_n\}\) and model with parameter \(\gamma\), then the likelihood is defined as,
\[ L(\gamma \mid X) = P(X \mid \gamma)\]
If our observations are independent and identically distributed, then we can treat this as product of probabilities of observing each individual data point. That is,
\[ L(\gamma \mid X) = \prod_{i=1}^{n} P(x_{i} \mid \gamma)\]
This may seem really abstract and so we can look at concrete example.
Suppose we draw two hundred people randomly and identify their gender:
# Randomly sample 200 people and identify their gender
random_gender <- sample(HANES$GENDER, 200, replace = FALSE)
random_gender
no_of_females <- sum(sapply(gregexpr("F", random_gender, fixed = TRUE), function(x) sum(x > -1)))
no_of_females
## [1] F M M M F M F M F M F F F M F F F F F M F M M F F M F M M M F F F F F
## [36] F M F M F M F M F M M M F F M F M F F F M F F F F F F F F F F M M F M
## [71] F M M F F F F F M M M F M M F M M M F F M F F F M F M F M M F M F F F
## [106] M F F F F M M F F F M F F F F F F F F F M F F F F F F F M F M F F F M
## [141] M M M F M M M M M M F M M F F F F M M F F M F M M F F F F F M M M F F
## [176] F F M F F M M F M F M F F M F M M F F M M M F F M
## Levels: M F
## [1] 118
The first thing we can do is assume that we know the proportion of males and females in the population, and ask ourselves how likely it would have been to draw this sequence if we were right about the proportion. We will call the proportion of females \(\gamma\) in this case.
Since there are only two categories, male and the female and each person in sampled independently the distribution being identical, we can calculate the probability of getting that sequence as the product of the probabilities of sampling each individual:
\[ L(\gamma \mid X) = \prod_{i=1}^{200} \begin{cases} \gamma, & \text{if } x_{i} = F \\ 1-\gamma, & \text{if } x_{i} = M \end{cases} \] Now we can use the above equation to calculate the probability of getting the specific sequence if the gender were equally distributed:
# Assign probablities to the above seqence assuming men and women are equally distributed
# in the general population
probabilities = ifelse(random_gender == "F", 0.5, 1-0.5)
# Likelihood is the product of the probablities when samples are IID
likelihood = prod(probabilities)
# Report the probability
cat('Probability of observation when men and women are equally distributed:',likelihood)
## Probability of observation when men and women are equally distributed: 6.223015e-61
If men are over-represented in the population, say, by 80%, then:
# Assign probablities to the above seqence assuming men represent 80% of
# the general population
probabilities = ifelse(random_gender == "F", 0.2, 1-0.2)
# Likelihood is the product of the probablities when samples are IID
likelihood = prod(probabilities)
# Report the probability
cat('Probability of observation when men are represented 80%:',likelihood)
## Probability of observation when men are represented 80%: 3.757668e-91
Thus we can easily generalize this as a function of \(\gamma\), and graph the likelihood as a function of any value of the proportion of females ranging from 0 to 1:
# Make a function that returns likelihoods based on the above sequence
# and the gamma parameter
likelihoodMF = function(gamma, data = random_gender) {
# Calculate the probabilities
probabilities = ifelse(random_gender == "F", gamma, 1-gamma)
# Calculate the product of probabilities, aka, likelihoods
likelihood = prod(probabilities)
# Return the likelihood
return(likelihood)
}
# Make a vector of gammas
gammas = seq(0.01, 0.99, by=0.01)
# Compute likelihood through the function
likelihoods = sapply(gammas, likelihoodMF)
# Plot gammas vs. likelihoods
ggplot(mapping = aes(x = gammas, y = likelihoods)) + geom_line(se=FALSE)
The same idea can be used to estimate the proportion of people with diabetes, instead of males and females:
# Select only people with diabetes and no diabetes
HANES_DIAB <- filter(HANES, DX_DBTS == "DIAB" | DX_DBTS == "NO DIAB") %>% select(DX_DBTS)
# Randomly sample 200 people and identify their diabetes status
HANES_DIAB_sampled_200 <- sample_n(HANES_DIAB,200)
# Count the number of people with diabetes and without diabetes
find_count <- HANES_DIAB_sampled_200 %>% group_by(DX_DBTS) %>% count()
# Find the number of people with diabetes
DIAB_count <- find_count$n[1]
# Make a function that returns likelihoods based on the above sequence
# and the gamma parameter
likelihoodDB = function(gamma, data = HANES_DIAB_sampled_200) {
# Calculate the probabilities
probabilities = ifelse(HANES_DIAB_sampled_200 == "DIAB", gamma, 1-gamma)
# Calculate the product of probabilities, aka, likelihoods
likelihood = prod(probabilities)
# Return the likelihood
return(likelihood)
}
# Make a vector of gammas
gammas = seq(0.01, 0.99, by=0.01)
# Compute likelihood through the function
likelihoods = sapply(gammas, likelihoodDB)
# Plot gammas vs. likelihoods
ggplot(mapping = aes(x = gammas, y = likelihoods)) + geom_line(se=FALSE)
Likelihood is not a probability function. Also, given this function it can be used to identify the parameters (proportions or means) of the underlying population that is likely to maximize the chance of observing our data. This is the maximum likelihood estimate. Thus, maximum likelihood estimate is the value of the parameters of a given model that maximizes the likelihood function for the data. Mathematically,
\[ \gamma_{MLE} = \max_{\gamma}[L(\gamma \mid X)] \]
The following code in R evaluates maximum likelihood parameter for the male/female distribution:
# Compute maximum likelihood parameter for male/female distribution
mle.resultsMF <- optimize(function(gammas) {likelihoodMF(gammas,likelihoodMF)},
interval = c(0, 1), maximum = TRUE)
mle.resultsMF
## $maximum
## [1] 0.5899919
##
## $objective
## [1] 1.617386e-59
Similarly the following code in R evaluates maximum likelihood parameter for the diabetes distribution:
# Compute maximum likelihood parameter for the diabetes distribution
mle.resultsDB <- optimize(function(gammas) {likelihoodDB(gammas,likelihoodDB)},
interval = c(0, 1), maximum = TRUE)
mle.resultsDB
## $maximum
## [1] 0.08499249
##
## $objective
## [1] 5.497711e-26
In our example of males and females, the maximum happens at 0.59, suggesting there are 59% of females in the population. This is unlike the diabetes observation where the maximum happens at 0.08 (which is 8%). This is not far from the official estimate of about 9.3%. Thus, maximum likelihood can be extremely useful to determine the parameters of the underlying distribution.
Determining maximum likelihood can become difficult as the complexity of the models increase. That is why we need cost functions.
Cost functions are functions that maps a set of events into a number that represents the cost of that event occurring. Mathematically it is defined as the negative log of likelihood.
\[ C(\gamma,X)=−\log[L(\gamma \mid X)] \] But why?
Lets watch this video for a good explanation:
There are other cost functions in areas such as motion planning and traveling salesman where different types of modeling techniques are used. For this tutorial we will use this cost function.
With this cost function as a negative log likelihood we can reformulate the cost of the given parameterization of the population and plot it. In the case of male/female observations here is the cost:
# Compute the -log of the likelihood function
MF.cost = function(gamma, data = random_gender) {
return(-log(likelihoodMF(gamma, data)))
}
# Find the costs
costs = sapply(gammas, MF.cost)
# Plot the costs
qplot(x = gammas, y = costs, geom = 'line', xlab = expression(gamma), ylab = "Male/Female Cost")
For the diabetes population, we have:
# Compute the -log of the likelihood function
DB.cost = function(gamma, data = HANES_DIAB_sampled_200) {
return(-log(likelihoodDB(gamma, data)))
}
# Find the costs
costs = sapply(gammas, DB.cost)
# Plot the costs
qplot(x = gammas, y = costs, geom = 'line', xlab = expression(gamma), ylab = "Diabetes Pop. Cost")
We note that the maximum of likelihood function is the minimum of the cost function because log likelihood is an increasing function of the likelihood and so negative log function will decrease as the likelihood function increases. Now once we have this cost function we can study modeling techniques to find its minimum.
Grid search is the simplest of the modeling technique. In this approach we break the interval of \(\gamma\) and evaluate the cost at every break point. We then pick the value of \(\gamma\) that gives the minimum cost. It is called grid search because it can be visualized by placing a grid over the cost function and evaluating that at each point of the grid:
This can be implemented for the diabetes example as follows:
# Make a grid of potential gammas
potential.gammas = seq(.001, .999, by=.001)
# Calculate the cost of the DB cost function at each gamma
calc.costs = sapply(potential.gammas, DB.cost)
# Find the minimum costs
idx.gamma = which(calc.costs == min(calc.costs))
# Output the results
writeLines(paste("Grid search value:", potential.gammas[idx.gamma], sep=' '))
## Grid search value: 0.085
We see the result is extremely accurate although the precision of this value is limited by how fine our grid is. Why is this approach bad? Although this is a good method for one parameter and low precision requirements, the computational complexity is enormous when we have multi-parameters. Here we had \(1000\) values for a precision upto two degrees. If we had one more parameter, we need to construct \(1000 \times 1000\) grid and for 3 parameters, the grid becomes intractable, a whopping \(1000 \times 1000 \times 1000\) computations, that too for a small precision.
We can certainly do smarter by making use of the information from the function and this is done by gradient descent algorithm.
Gradient descent is a method where we slide along the function in the direction that decreases the cost function, and keep repeating until there is only the tiniest decrease in cost with each step. Mathematically, it is described by:
\[ \gamma_{t+1} = \gamma_t - \theta \text{ } \nabla C(\gamma_t, X) \]
where \(\gamma_{0}\) is the initial value of the parameter and \(\theta\) is a step size parameter that determines how fast (or slow) we descend. It is also called the learning rate. We should also have a threshold \(\tau\) that represents how precise we want the cost function to be estimated. In single variable calculus, \(\nabla C\) is given by the derivative,
\[ \nabla C = \frac{dC}{d\gamma} \]
and hence in our case of diabetic population, it is extremely easy to compute this analytically - we can apply the rules of calculus to calculate the derivative:
\[ \nabla C(\gamma, X) = \frac{200-\text{DIAB_count}}{1-\gamma} - \frac{\text{DIAB_count}}{\gamma} \]
where DIAB_count
was determined to be 17.
We can then run the gradient descent. Here we fix the initial parameters: starting value \(\gamma_{0} = 0.95\), \(\theta=0.0001\) and the threshold \(τ=0.000001\).
# Initial conditions/parameters
gamma = 0.95
theta = 0.0001
tau = 0.000001
# Compute the gradient
DB.gradient = function(gamma) {
return(((200-DIAB_count)/(1-gamma)) - (DIAB_count/gamma))
}
# Compute the cost
prior.cost = DB.cost(gamma)
# plot.gammas or plot.costs are for graphing purposes
plot.gammas = gamma
plot.costs = prior.cost
running = TRUE
while(running) {
# Compute gradient
grad = DB.gradient(gamma)
# Compute the next gamma
gamma = gamma - theta * grad
# Compute the new costs
new.cost = DB.cost(gamma)
# For plotting purposes
plot.gammas = c(plot.gammas, gamma)
plot.costs = c(plot.costs, new.cost)
# See if the sequence converges upto a a certain precision
if(abs(new.cost - prior.cost) < tau) {
# If so stop the iteration
running = FALSE
} else {
# Else update the cost function to new cost
prior.cost = new.cost
}
}
# This is our solution
writeLines(paste("Gradient descent value:",gamma,sep=' '))
## Gradient descent value: 0.0850289188156921
We can now see the convergence:
# Make a vector of gammas
all.gammas = seq(.01,.99,by=.01)
# Find the costs for all gammas
all.costs = sapply(all.gammas,DB.cost)
# Make a data frame from the gammas and the costs computed above
plot.df = data.frame(gammas=plot.gammas,cost=plot.costs)
# Plot the cost function and add only the converging points
qplot(x=all.gammas, y=all.costs, geom='line', xlab=expression(gamma), ylab="Cost") +
geom_point(data=plot.df, aes(x=gammas,y=cost),color='blue')
Several times we might not have an analytical form for the derivative. Or we out of luck then? We can still apply numerical approximation for the derivative:
\[ \nabla C(\gamma, X) \approx \frac{C(\gamma + \epsilon) - C(\gamma - \epsilon)}{2\epsilon} \] Here is the code with the numerical approximation for the derivative:
# Initial conditions/parameters
gamma = 0.95
theta = 0.0001
tau = 0.000001
# Compute the gradient using approximation
DB.gradient = function(gamma, epsilon = .001) {
return((DB.cost(gamma+epsilon) - DB.cost(gamma-epsilon)) / (2*epsilon))
}
# Compute the cost
prior.cost = DB.cost(gamma)
# plot.gammas or plot.costs are for graphing purposes
plot.gammas = gamma
plot.costs = prior.cost
running = TRUE
while(running) {
# Compute gradient
grad = DB.gradient(gamma)
# Compute the next gamma
gamma = gamma - theta * grad
# Compute the new costs
new.cost = DB.cost(gamma)
# For plotting purposes
plot.gammas = c(plot.gammas, gamma)
plot.costs = c(plot.costs, new.cost)
# See if the sequence converges upto a a certain precision
if(abs(new.cost - prior.cost) < tau) {
# If so stop the iteration
running = FALSE
} else {
# Else update the cost function to new cost
prior.cost = new.cost
}
}
# This is our solution
writeLines(paste("Gradient descent value:", gamma, sep=' '))
## Gradient descent value: 0.0850324517189959
We see the answer is almost the same! Wonderful.
So far we saw modeling in single variable. But in general this is not adequate and we may need to model on several variables. Let us consider an example.
Glucose test and hemoglobin test (A1C) are two common tests to identify diabetes in a population. If a person has (fasting) glucose levels greater than or equal to 126 mg/dL or A1C levels greater than or equal to 6.5% then they are diagnosed diabetic. Suppose we want to estimate the probability of someone being diabetic given these two variables. We can work backwards from our observation to determine this probability.
We define \(P(Diab \mid A1C \geq 6.5 \%) = \theta_{A1C}\) and \(P(Diab \mid GLUCOSE \geq 126 \ mg/dL) = \theta_{GLU}\). From the laws of probability we have:
\[ P(Diab) = \begin{cases} p & if \ A1C < 6.5\% \ \& \ GLUCOSE < 126 \ mg/dL \\ \theta_{A1C} & if\ A1C \geq 6.5\% \ \& \ GLUCOSE < 126 \ mg/dL \\ \theta_{GLU} & if \ A1C < 6.5\% \ \& \ GLUCOSE \geq 126 \ mg/dL \\ \theta_{A1C}+\theta_{GLU}-\theta_{A1C}*\theta_{GLU} & if \ A1C \geq 6.5\% \ \& \ GLUCOSE \geq 126 \ mg/dL \end{cases} \]
Thus, we have these two hidden variables \(\theta_{A1C}\) and \(\theta_{GLU}\) that determines the probability of someone being diabetic. Also we have the variable \(p\) that defines the population that is unexplained by the hidden variables. This value can be directly computed from the observations. However for the hidden variables, we have to figure out the values from the observations using optimization. To do this we will extract the observations from the data table:
# Make logicals based on diagnostic levels of A1C, GLUCOSE and diabetes status
DB.obs <- select(HANES,A1C, GLUCOSE, DX_DBTS) %>%
mutate(A1C_L = as.logical(ifelse(A1C >= 6.5,1,0)),
GLUCOSE_L= as.logical(ifelse(GLUCOSE >= 126,1,0)),
DX_DBTS_L = as.logical(ifelse(DX_DBTS == "DIAB",1,0))) %>%
select(A1C_L,GLUCOSE_L,DX_DBTS_L)
# Print the first few lines to see if the data looks okay
print(head(DB.obs))
## A1C_L GLUCOSE_L DX_DBTS_L
## 1 FALSE FALSE FALSE
## 2 FALSE FALSE FALSE
## 3 FALSE FALSE FALSE
## 4 FALSE FALSE FALSE
## 5 FALSE FALSE FALSE
## 6 FALSE FALSE FALSE
We can list the data matrix corresponding to the observations and determine the variable \(p\) from it.
# Print the data matrix corresponding to the observations
print(with(DB.obs,table(A1C_L,GLUCOSE_L,DX_DBTS_L)))
# Find the number of people with diabetes
DBTS_T <- table(DB.obs$DX_DBTS_L == T )
# Find the number of people with diabetes where
# A1C_L <= 6.5 and GLUCOSE_L <= 126
# It is this population that is unexplained by both variables
A1C_GLU_unexplained <- with(DB.obs,table(A1C_L == F & GLUCOSE_L == F & DX_DBTS_L == T))
# Find the proportion of people with diabetes that are unexplained, the variable $p$
DBTS_U <- c(proportion = A1C_GLU_unexplained[2]/DBTS_T[2])
# Name the column and print the proportion
names(DBTS_U) <- c("DBTS_U")
DBTS_U
## , , DX_DBTS_L = FALSE
##
## GLUCOSE_L
## A1C_L FALSE TRUE
## FALSE 971 10
## TRUE 13 14
##
## , , DX_DBTS_L = TRUE
##
## GLUCOSE_L
## A1C_L FALSE TRUE
## FALSE 21 6
## TRUE 24 53
## DBTS_U
## 0.2019231
We then define our likelihood function. This function takes the parameters in a single argument and returns the cost vector as negative log likelihood function.
# Define the cost function
DB.MG.cost = function(params, data=DB.obs) {
# Receive the parameters in theta variables
theta.A1C = params[1]
theta.GLU = params[2]
# Define the probability model as given by the equation above
predicted.diab.A1C = ifelse(data$A1C_L, theta.A1C, 0)
predicted.diab.GLU = ifelse(data$GLUCOSE_L, theta.GLU, 0)
predicted.diab = predicted.diab.A1C + predicted.diab.GLU -
(predicted.diab.A1C * predicted.diab.GLU)
# Define the likelihood function
likelihood_explained = ifelse(data$DX_DBTS_L, predicted.diab, 1-predicted.diab)
likelihood = ifelse(likelihood_explained == 0, DBTS_U, likelihood_explained)
# Find the cost function as the sum of negative log likelihoods
cost.DB = -sum(log(likelihood))
# Return the cost
return(cost.DB)
}
We can now see what the cost function looks like by varying the conditional probabilities. The lower the cost is, more red will be the point.
# Define the grid by varing the conditional probabilities between 0.01 to 0.99 by
# increments of 0.01
DB.grid = data.frame(A1C = rep(seq(.01,.99,by=.01),times=99),
GLU = rep(seq(.01,.99,by=.01),each=99))
# Find the cost by evaluating the above cost function using the grid just defined
DB.grid$cost.DB = mapply(function(w,h){DB.MG.cost(c(w,h))},
DB.grid$A1C, DB.grid$GLU)
# Create a plot the "cost grid" and fill in the values
DB.plot = ggplot(DB.grid, aes(x=A1C, y=GLU)) +
geom_raster(aes(fill=cost.DB)) +
scale_fill_gradient(low='red', high='yellow', trans = 'log') +
stat_contour(aes(z=cost.DB),binwidth=5, color = 'black', linetype = 'dashed') +
xlab('P(Diabetes | A1C)') + ylab('P(Diabetes | Glucose)')
# Make a plot
print(DB.plot)
We see that the cost is minimized somewhere around 0.6 and 0.5. We hope our optimization techniques will give us the values of the hidden variables around this value. The optimization for multiple variables \(\gamma_{0}, \gamma_{1}, \dots \gamma_{N}\), using gradient descent is just like the optimization for a single variable. Here the gradient function is given by:
\[ \nabla C(\gamma, X) = \bigg[ \frac{\partial C}{\partial \gamma_0}, \frac{\partial C}{\partial \gamma_1}, ..., \frac{\partial C}{\partial \gamma_N} \bigg] \] in particular in our case,
\[ \nabla C(\gamma, X) = \bigg[ \frac{\partial C}{\partial \theta_{A1C}}, \frac{\partial C}{\partial \theta_{GLU}} \bigg] \]
which can be approximated like we did for the single variable, however we have to do it for each variable here.
# Find gradient through approximation
DB.gradient = function(params, data=DB.obs, epsilon = .001) {
# Find A1C gradient by perturbing in both directions
A1C.plus = DB.MG.cost(params + c(epsilon,0), DB.obs)
A1C.minus = DB.MG.cost(params + c(-epsilon,0), DB.obs)
df.dA1C = (A1C.plus - A1C.minus) / (2*epsilon)
# Find GLU gradient by perturbing in both directions
GLU.plus = DB.MG.cost(params + c(0,epsilon), DB.obs)
GLU.minus = DB.MG.cost(params + c(0, -epsilon), DB.obs)
df.dGLU = (GLU.plus - GLU.minus) / (2*epsilon)
# Return the gradient
return(c(df.dA1C, df.dGLU))
}
This helps us to see how the gradient descent works in two dimensions. The gradient points towards the direction of lowering cost function. If we are too much away from the bottom we need to take larger steps, otherwise smaller steps are sufficient for convergence.
# Demostrate the gradient descent in two dimensions using arrows
# Make an arrow using given gradient
make.gradient.arrow = function(params) {
grad = DB.gradient(params)
end.x = params[1] - grad[1]*.001
end.y = params[2] - grad[2]*.001
return(geom_segment(x=params[1],y=params[2],xend=end.x,yend=end.y, arrow = arrow(length=unit(.02, 'npc')),color='blue'))
}
# Make the arrow + cost grid plot
DB.grad.plot = DB.plot +
make.gradient.arrow(c(.2,.2)) + make.gradient.arrow(c(.8,.8)) +
make.gradient.arrow(c(.2,.8)) + make.gradient.arrow(c(.8,.2))
# Plot the result
print(DB.grad.plot)
We are now ready to see gradient descent in action. It works exactly as in one dimensional case, except that we need to work with vectors. We need to initalize the conditional probablities \(\theta_{A1C}\) and \(\theta_{GLU}\). We will assume they have a great predictive power of about 90%. Lets fix the convergence threshold, \(\tau\) as 0.01 and the learning parameter, now denoted by \(\alpha\) as 0.0001.
# Initialize the parameters
theta = c(.9, .9)
alpha = 0.0001
tau = .01
# Compute the prior run's cost
prior.cost = DB.MG.cost(theta)
running = TRUE
# Plots for showing the descent
plt.xs = theta[1]
plt.ys = theta[2]
while(running) {
# Compute the gradient
grad = DB.gradient(theta)
# Update the probabilities
theta = theta - alpha * grad
# For plotting purposes
plt.xs = c(plt.xs, theta[1])
plt.ys = c(plt.ys, theta[2])
# Compute the new cost
new.cost = DB.MG.cost(theta)
# Check if the cost sequence converges upto the given threshold
if(abs(new.cost - prior.cost) < tau) {
# If the cost gets converged, stop the iteration
running = FALSE
} else {
# Else update the cost
prior.cost = new.cost
}
}
# Code to plot the descent path
grad.plot = DB.plot
for (i in 2:length(plt.xs)) {
grad.plot = grad.plot + geom_segment(x = plt.xs[i-1], xend = plt.xs[i],
y = plt.ys[i-1], yend = plt.ys[i],
color = 'blue')
}
# Make the plot
grad.plot = grad.plot + annotate('text', x = plt.xs[1], y = plt.ys[1], color = 'blue', label = 'O') +
annotate('text', x = theta[1], y = theta[2], color = 'blue', label = 'X')
# Plot the results
print(grad.plot)
We can now see the values of the hidden variables \(\theta_{A1C}\) and \(\theta_{GLU}\).
# Print the values of the conditional probabilites P(diab|A1C) and P(diab|GLU)
writeLines(paste("Theta_A1C: ",theta[1],"\nTheta_GLU: ",theta[2],sep=''))
## Theta_A1C: 0.620294470407629
## Theta_GLU: 0.48994822007435
Thus, the probability of observing someone with diabetes given A1C value is 62% and the probablity given GLUCOSE value is 49% and given both values the probablity is 81%. From this analysis we can conclude it is best to use both variables, than just either A1C or GLUCOSE.
Classwork/Homework: Follow the above analysis and estimate the probability of someone being diabetic using HDL and LDLESTIMATE variables, instead of A1C and GLUCOSE used in the analysis.
Several of these optimization techniques are already implemented in R. The package/function is called optim()
. All this function needs are two inputs - the initial guess for the parameters and the cost function:
# Use the optim package to evaluate the parameters
o = optim(c(.95,.95), DB.MG.cost)
print(o)
## $par
## [1] 0.6546962 0.3833888
##
## $value
## [1] 102.5192
##
## $counts
## function gradient
## 67 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
Note that the result varies but kinda compares to our results we obtained through gradient descent (62% vs 65% for A1C and 48% vs 38% for GLUCOSE). The difference (especially for the glucose variable) is most likely due to the existence of multiple minimas (we see “red” regions throughout) and the default method option()
uses - the Nelder-Mead method although is computationally more efficient, can get struck in a minimum. The gradient descent too will suffer from this problem depending on the initial conditions, as we will see shortly.
There are other algorithms for optimization which can be easily used with the optim()
function. The BFGS algorithm is one such algorithm. To use this, we just need to specify method='BGFS'
argument of the package.
# Use the optim package with BGFS method to evaluate the parameters
o = optim(c(.95,.95), DB.MG.cost, method = 'BFGS')
print(o)
## $par
## [1] 0.6547071 0.3831999
##
## $value
## [1] 102.5192
##
## $counts
## function gradient
## 31 10
##
## $convergence
## [1] 0
##
## $message
## NULL
Further, the optim()
function allows us to specify the gradiant function if we know them analytically. This can be done by specifying the argument gr
. In our case,
# Use the optim package with BGFS method to evaluate the parameters
# by specifying the gradient function
o = optim(c(.95,.95), DB.MG.cost, method = 'BFGS', gr = DB.gradient)
print(o)
## $par
## [1] 0.6547071 0.3831999
##
## $value
## [1] 102.5192
##
## $counts
## function gradient
## 31 10
##
## $convergence
## [1] 0
##
## $message
## NULL
And finally, the L-BFSG-B algorithm will allow us to perform constrained optimizaion where we can specify the lower and upper bounds on the parameters. Here since they are probabilities they vary between 0 and 1. And so we can specify:
# Use the optim package with L-BFGS-B method and specifying the upper and lower
# limits of the parameters
o = optim(c(.95,.95), DB.MG.cost, method = 'L-BFGS-B',
lower = c(.00001, .00001), upper = c(.99999, .99999))
print(o)
## $par
## [1] 0.6547060 0.3832008
##
## $value
## [1] 102.5192
##
## $counts
## function gradient
## 10 10
##
## $convergence
## [1] 0
##
## $message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
Note that in the previous algorithm we described, every step of the descent required a call to compute the cost. In the case of HANES data, it was managable since we had 1112 observations. What if there are a several millions of observations? For instance MIMIC3 has tens of millions of observations. The total number of calls would be formidable. In such cases, we can make use of stochastic gradient descent. Stochastic gradient descent works on the principle that when the observations are independent the cost function can be decomposed - the cost of errors on one observation is independent of any other observation and randomly choosing some observations we can get a good estimate of the cost function. Mathematically, \[ C(\gamma, X) \approx \sum_{i=1}^{N}C(\gamma, x_i) \] where \(x_i\)’s are individual observations chosen randomly.
Thus, we can perform stochastic gradient descent choosing only 100 observations at each time step to estimate the cost in our demostration of finding the probability of person being diabetic given A1C and GLUCOSE variables.
# Initialize the parameters
theta = c(.9, .9)
alpha_init= .0001
tau = .01
# Choose the number of observations we want to use
n.obs = 100
# Define prior cost
prior.cost = DB.MG.cost(theta, DB.obs)
running = TRUE
nth.iter = 0
# Plots for showing the descent
plt.xs = theta[1]
plt.ys = theta[2]
plt.ns = NULL
while(running) {
# Iterate until the total cost converges
nth.iter = nth.iter + 1
alpha = alpha_init / nth.iter
# Choose random sample from the data set to estimate the cost
obs.order = sample(1:n.obs, n.obs)
tot.cost = 0
# For those observations estimate the total cost
for(d.i in obs.order) {
# Find the gradient and theta
grad = DB.gradient(theta, DB.obs[d.i,])
theta = theta-alpha*grad
# For plotting purposes, append the thetas
plt.xs = c(plt.xs, theta[1])
plt.ys = c(plt.ys, theta[2])
plt.ns = c(plt.ns, nth.iter)
# Find the total cost for only those the random sample of observations
tot.cost = tot.cost + DB.MG.cost(theta, DB.obs[d.i,])
}
# If the sequence converges within our threshold, stop else continue
if(abs(tot.cost - prior.cost) < tau) {
running = FALSE
} else {
prior.cost = tot.cost
}
}
# Print the final theta values
writeLines(paste("Theta_A1C: ",theta[1],"\nTheta_GLU: ",theta[2],sep=''))
## Theta_A1C: 0.636875692246907
## Theta_GLU: 0.429515626545685
We see the values are really close to what we estimated by multi-variable gradient descent. We can infact plot the convergence of the iterations by giving different colors - first iteration purple, second yellow etc.
In fact, if we zoon in, we can see how the convergence behaves:
Thus, we only need three iterations for convergence by choosing only 100 observations. Thus, stochastic gradient descent may save us substantially if we work with big data.
In our example, it is established that A1C values greater than or equal to 6.5 and glucose levels greater than or equal to 126 are indicators of diabetes. In case if we don’t know this information, one can iterate through pairs of numbers (A1C value, GLUCOSE value) to estimate the pair of values that gives maximum probability of person having diabetes. Since these computations are independent on a number pair-level, one can parallelize such computations through frameworks such as Map-Reduce/High performance computing and stochastic gradient can be used to detrmine the minimal cost for each such pair and eventually chooing the cost (and the pair) that is minimum of all minimums. Such analysis will give us the thresholds for classifying diabetic populations. This type of analysis can also be extended for other variables.
Consider this cost function:
\[ C(\theta) = \theta * (\theta + 1) * (\theta + 1.5) * (\theta * 2) \] Let us compute the minimum for this function through gradient descent.
# Define the function
arbitrary.cost = function(theta) {
return(theta*(theta+1)*(theta+1.5)*(theta-2))
}
# Find its gradient
arbitrary.gradient = function(theta, epsilon = 0.001) {
return( (arbitrary.cost(theta+epsilon) - arbitrary.cost(theta-epsilon)) / (2*epsilon) )
}
# Initalize parameters
theta = -2.5
gamma = .0001
tau = .000001
prior.cost = arbitrary.cost(theta)
running = TRUE
# Run the descent algorithm
while(running) {
grad = arbitrary.gradient(theta)
theta = theta - gamma * grad
new.cost = arbitrary.cost(theta)
if(abs(new.cost - prior.cost)< tau) {
running = FALSE
} else {
prior.cost = new.cost
}
}
# Write the output
writeLines(paste("Gradient descent value:",theta,sep=' '))
## Gradient descent value: -1.29429022807873
Great! Are we done? Let’s look at the function:
# Plot function vs costs
thetas = seq(-3,3,by=.01)
costs = sapply(thetas,arbitrary.cost)
qplot(thetas,costs,geom='line',xlab=expression(theta),ylab='Cost') + coord_cartesian(ylim=c(-8,10))
We see the minimum is actually 1.3, however we got -1.29 as the answer. Why is that? Because there are two types of minimum.
Global minimum: a point \(x^{*}\) is a global minimum if it is the lowest value of the function across all allowable values of the parameter.
Local minimum: a point \(x^{*}\) is a local minimum if it is the lowest value of the function within some range centered on \(x^{*}\).
Here 1.29 is the local minimum and 1.3 is the global minimum and gradient descent is highly sensitive to initial conditions. That is why it is important to try several initial conditions. Gradient descent doesn’t give any claims to giving global optimal results. The result could equally be locally optimal or sub-optimal.
Therefore are we out of luck? Not really! There are some class of functions where there can be no different types of minima. In other words, all local minimim are global minimums. These are called convex functions and there are lots of methods to do optimization for this nice class of functions. This sub-field of optmization is called convex optimization. Loosely speaking, a function is convex if for every line segment drawn between two points on the function, that line segment passes above the graph. A function that is not convex is called a non-convex function.
For example, the above function is non-convex, the reason why it has multiple minimums.
Here is a rigorous proof of why convex functions have only global minimums. Mathematically, convex functions are defined by:
\[ f(λy + (1 − λ)x) ≤ λf(y) + (1 − λ)f(x) \]
Theorem: For a convex function defined on a convex set, any local minimum is also a global minimum.
Proof: Let \(f:C \rightarrow \mathbb{R}\), be a convex function, on a convex set \(C\). Suppose that \(x\) is a local minimum of \(f\) but not a global one. This means \(\exists y \in C\) such that \(f(y) < f(x)\). By definition of convexity for any \(\lambda \in [0,1]\), we have,
\[ f(x+\lambda (y-x)) \leq f(x) + \lambda [f(y) - f(x)] \\ < f(x) \] since \(f(y) < f(x)\). Hence, for any small ball \(B(x;\epsilon)\) around \(x\), there exists \(z\) such that \(f(z) < f(x)\). Therefore, \(x\) is not a local minimum - a contradiction. QED.