An Introduction to Optimization (Oct. 23, 2017)

Optimization is a graduate level course for a semester or two, typically in engineering and mathematics departments.

Here we will be giving an overview of the subject, in particular on how the concepts fit into other areas such as machine learning, image processing etc.

We will see the following:

  1. Different optimization concepts.

  2. Commonly used techniques for optimization.

  3. Optimization with R.

based on an overview of the following concepts:

  1. Likelihood and maximum likelihood estimates
  2. Cost functions
  3. Optimization in one variable and gradient descent
  4. Multi-variable optimization
  5. Introduction to big data optimization
  6. Local minima and convex functions

Likelihood and Maximum Likelihood Estimates

Likelihood is the probablity of observing the data given a model.

In the univariate case, let us say that we have a set of data \(X = \{x_1,x_2,\dots,x_n\}\) and model with parameter \(\gamma\), then the likelihood is defined as,

\[ L(\gamma \mid X) = P(X \mid \gamma)\]

If our observations are independent and identically distributed, then we can treat this as product of probabilities of observing each individual data point. That is,

\[ L(\gamma \mid X) = \prod_{i=1}^{n} P(x_{i} \mid \gamma)\]

This may seem really abstract and so we can look at concrete example.

Let us work with the HANES data set.

  # Load the package RCurl
  library(RCurl)
  # Import the HANES data set from GitHub; break the string into two for readability
  # (Please note this readability aspect very carefully)
  URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
  URL_text_2 <- "/master/Datasets/HANES/NYC_HANES_DIAB.csv"
  # Paste it to constitute a single URL 
  URL <- paste(URL_text_1,URL_text_2, sep="")
  HANES <- read.csv(text=getURL(URL))
  # Rename the GENDER factor for identification
  HANES$GENDER <- factor(HANES$GENDER, labels=c("M","F"))
  # Rename the AGEGROUP factor for identification
  HANES$AGEGROUP <- factor(HANES$AGEGROUP, labels=c("20-39","40-59","60+"))
  # Rename the HSQ_1 factor for identification
  HANES$HSQ_1 <- factor(HANES$HSQ_1, labels=c("Excellent","Very Good","Good", "Fair", "Poor"))
  # Rename the DX_DBTS as a factor
  HANES$DX_DBTS <- factor(HANES$DX_DBTS, labels=c("DIAB","DIAB NO_DX","NO DIAB"))
  # Omit all NA from the data frame
  HANES <- na.omit(HANES)
  # Observe the structure
  str(HANES)
  # Load the tidyverse library
  library(tidyverse)
  # Set the seed for reproducibility
  set.seed(23960671)

## 'data.frame':    1112 obs. of  23 variables:
##  $ KEY              : Factor w/ 1527 levels "133370A","133370B",..: 28 43 44 53 55 70 84 90 100 107 ...
##  $ GENDER           : Factor w/ 2 levels "M","F": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SPAGE            : int  29 28 27 24 30 26 31 32 34 32 ...
##  $ AGEGROUP         : Factor w/ 3 levels "20-39","40-59",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ HSQ_1            : Factor w/ 5 levels "Excellent","Very Good",..: 2 2 2 1 1 3 1 2 1 3 ...
##  $ UCREATININE      : int  105 53 314 105 163 150 46 36 177 156 ...
##  $ UALBUMIN         : num  0.707 1 8 4 3 2 2 0.707 4 3 ...
##  $ UACR             : num  0.00673 2 3 4 2 ...
##  $ MERCURYU         : num  0.37 0.106 0.487 2.205 0.979 ...
##  $ DX_DBTS          : Factor w/ 3 levels "DIAB","DIAB NO_DX",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ A1C              : num  5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 5.2 ...
##  $ CADMIUM          : num  0.2412 0.1732 0.0644 0.0929 0.1202 ...
##  $ LEAD             : num  1.454 1.019 0.863 1.243 0.612 ...
##  $ MERCURYTOTALBLOOD: num  2.34 2.57 1.32 14.66 2.13 ...
##  $ HDL              : int  42 51 42 61 52 50 57 56 42 44 ...
##  $ CHOLESTEROLTOTAL : int  184 157 145 206 120 155 156 235 156 120 ...
##  $ GLUCOSESI        : num  4.61 4.77 5.16 5 5.11 ...
##  $ CREATININESI     : num  74.3 73 80 84.9 66 ...
##  $ CREATININE       : num  0.84 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 1.09 ...
##  $ TRIGLYCERIDE     : int  156 43 108 65 51 29 31 220 82 35 ...
##  $ GLUCOSE          : int  83 86 93 90 92 85 72 87 96 92 ...
##  $ COTININE         : num  31.5918 0.0635 0.035 0.0514 0.035 ...
##  $ LDLESTIMATE      : int  111 97 81 132 58 99 93 135 98 69 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:415] 2 15 16 24 26 28 33 34 35 39 ...
##   .. ..- attr(*, "names")= chr [1:415] "2" "15" "16" "24" ...

Suppose we draw two hundred people randomly and identify their gender:

  # Randomly sample 200 people and identify their gender
  random_gender <- sample(HANES$GENDER, 200, replace = FALSE)
  random_gender
  no_of_females <- sum(sapply(gregexpr("F", random_gender, fixed = TRUE), function(x) sum(x > -1)))
  no_of_females

##   [1] F F F F F F M M M M M F F M F F F M F M M M F F F F F F M F F F F M F
##  [36] M F F F M F M M M F M F M M F F F F F F M M M M F M F M M M F M M M F
##  [71] M F F F M F F M F M M F F F F M M F F M F M M M F M F M M F F M F M M
## [106] M F F M F F F M M F M M M M M F M M F M M M F M F M F F F M M F F F M
## [141] F F F M F M F F F F M F F M F F M F M F M F F M F F F F F F F F F F F
## [176] M M M M M F M M F F M M M F M F M M M F M F M F M
## Levels: M F
## [1] 108

The first thing we can do is assume that we know the proportion of males and females in the population,

Ask ourselves how likely it would have been to draw this sequence if we were right about the proportion.

We will call the proportion of females \(\gamma\) in this case.

Note: there are only two categories, male and the female and each person in sampled independently the distribution being identical.

Thus, we can calculate the probability of getting that sequence as the product of the probabilities of sampling each individual:

\[ L(\gamma \mid X) = \prod_{i=1}^{200} \begin{cases} \gamma, & \text{if } x_{i} = F \\ 1-\gamma, & \text{if } x_{i} = M \end{cases} \]

Now we can use the above equation to calculate the probability of getting the specific sequence if the gender were equally distributed:

  # Assign probablities to the above seqence assuming men and women are equally distributed
  # in the general population
  probabilities = ifelse(random_gender == "F", 0.5, 1-0.5)
  # Likelihood is the product  of the probablities when samples are IID 
  likelihood = prod(probabilities)
  # Report the probability
  cat('Probability of observation when men and women are equally distributed:',likelihood)

## Probability of observation when men and women are equally distributed: 6.223015e-61

If men are over-represented in the population, say, by 80%, then:

  # Assign probablities to the above seqence assuming men represent 80% of
  # the general population
  probabilities = ifelse(random_gender == "F", 0.2, 1-0.2)
  # Likelihood is the product  of the probablities when samples are IID 
  likelihood = prod(probabilities)
  # Report the probability
  cat('Probability of observation when men are represented 80%:',likelihood)

## Probability of observation when men are represented 80%: 3.940201e-85

Thus we can easily generalize this as a function of \(\gamma\), and graph the likelihood as a function of any value of the proportion of females ranging from 0 to 1:

  # Make a function that returns likelihoods based on the above sequence
  # and the gamma parameter
  likelihoodMF = function(gamma, data = random_gender) {
    # Calculate the probabilities
    probabilities = ifelse(random_gender == "F", gamma, 1-gamma)
    # Calculate the product of probabilities, aka, likelihoods
    likelihood = prod(probabilities)
    # Return the likelihood
    return(likelihood)
  }
  # Make a vector of gammas
  gammas = seq(0.01, 0.99, by=0.01)
  # Compute likelihood through the function 
  likelihoods = sapply(gammas, likelihoodMF)
  # Plot gammas vs. likelihoods
  ggplot(mapping = aes(x = gammas, y = likelihoods)) + geom_line(se=FALSE)

The same idea can be used to estimate the proportion of people with diabetes, instead of males and females:

  # Select only people with diabetes and no diabetes
  HANES_DIAB <- filter(HANES, DX_DBTS == "DIAB" | DX_DBTS == "NO DIAB") %>% select(DX_DBTS)
  # Randomly sample 200 people and identify their diabetes status
  HANES_DIAB_sampled_200 <- sample_n(HANES_DIAB,200) 
  # Count the number of people with diabetes and without diabetes
  find_count <- HANES_DIAB_sampled_200 %>% group_by(DX_DBTS) %>% count()
  # Find the number of people with diabetes
  DIAB_count <- find_count$n[1]
  # Make a function that returns likelihoods based on the above sequence
  # and the gamma parameter
  likelihoodDB = function(gamma, data = HANES_DIAB_sampled_200) {
    # Calculate the probabilities
    probabilities = ifelse(HANES_DIAB_sampled_200 == "DIAB", gamma, 1-gamma)
    # Calculate the product of probabilities, aka, likelihoods
    likelihood = prod(probabilities)
    # Return the likelihood
    return(likelihood)
  }
  # Make a vector of gammas
  gammas = seq(0.01, 0.99, by=0.01)
  # Compute likelihood through the function 
  likelihoods = sapply(gammas, likelihoodDB)
  # Plot gammas vs. likelihoods
  ggplot(mapping = aes(x = gammas, y = likelihoods)) + geom_line(se=FALSE)

Likelihood is not a probability function.

Also, given this function it can be used to identify the parameters (proportions or means) of the underlying population that is likely to maximize the chance of observing our data.

This is the maximum likelihood estimate.

Thus, maximum likelihood estimate is the value of the parameters of a given model that maximizes the likelihood function for the data.

Mathematically,

\[ \gamma_{MLE} = \max_{\gamma}[L(\gamma \mid X)] \]

The following code in R evaluates maximum likelihood parameter for the male/female distribution:

  # Compute maximum likelihood parameter for male/female distribution
  mle.resultsMF <- optimize(function(gammas) {likelihoodMF(gammas,likelihoodMF)},
                        interval = c(0, 1), maximum = TRUE)
  mle.resultsMF

## $maximum
## [1] 0.5400031
## 
## $objective
## [1] 1.180991e-60

Similarly the following code in R evaluates maximum likelihood parameter for the diabetes distribution:

  # Compute maximum likelihood parameter for the diabetes distribution
  mle.resultsDB <- optimize(function(gammas) {likelihoodDB(gammas,likelihoodDB)},
                        interval = c(0, 1), maximum = TRUE)
  mle.resultsDB

## $maximum
## [1] 0.0750193
## 
## $objective
## [1] 7.280132e-24

In our example of males and females, the maximum happens at 0.54, suggesting there are 54% of females in the population.

This is unlike the diabetes observation where the maximum happens at 0.08 (which is 8%).

This is not far from the official estimate of about 9.3%.

Thus, maximum likelihood can be extremely useful to determine the parameters of the underlying distribution.

Determining maximum likelihood can become difficult as the complexity of the models increase. That is why we need optimization techniques and what are known as cost functions.

Cost Functions

Cost functions are functions that maps a set of events into a number that represents the cost of that event occurring. Mathematically it is defined as the negative log of likelihood.

\[ C(\gamma,X)=−\log[L(\gamma \mid X)] \] But why?

Lets watch this video for a good explanation:

There are other cost functions in areas such as motion planning and traveling salesman where different types of optimization techniques are used. For this tutorial we will use this cost function.

With this cost function as a negative log likelihood we can reformulate the cost of the given parameterization of the population and plot it. In the case of male/female observations here is the cost:

  # Compute the -log of the likelihood function
  MF.cost = function(gamma, data = random_gender) {
    return(-log(likelihoodMF(gamma, data)))
  }
  # Find the costs
  costs = sapply(gammas, MF.cost)
  # Plot the costs
  qplot(x = gammas, y = costs, geom = 'line', xlab = expression(gamma), ylab = "Male/Female Cost")

For the diabetes population, we have:

  # Compute the -log of the likelihood function
  DB.cost = function(gamma, data = HANES_DIAB_sampled_200) {
    return(-log(likelihoodDB(gamma, data)))
  }
  # Find the costs
  costs = sapply(gammas, DB.cost)
  # Plot the costs
  qplot(x = gammas, y = costs, geom = 'line', xlab = expression(gamma), ylab = "Diabetes Pop. Cost")

Note: The maximum of likelihood function is the minimum of the cost function because log likelihood is an increasing function of the likelihood

Therefore, negative log function will decrease as the likelihood function increases.

Now once we have this cost function we can study optimization techniques to find its minimum.

Optimization in One variable and Gradient Descent

Grid search is the simplest of the optimization technique.

In this approach we break the interval of \(\gamma\) and evaluate the cost at every break point.

We then pick the value of \(\gamma\) that gives the minimum cost.

It is called grid search because it can be visualized by placing a grid over the cost function and evaluating that at each point of the grid:

This can be implemented for the diabetes example as follows:

  # Make a grid of potential gammas
  potential.gammas = seq(.001, .999, by=.001)
  # Calculate the cost of the DB cost function at each gamma
  calc.costs = sapply(potential.gammas, DB.cost)
  # Find the minimum costs
  idx.gamma = which(calc.costs == min(calc.costs))
  # Output the results
  writeLines(paste("Grid search value:", potential.gammas[idx.gamma], sep=' '))

## Grid search value: 0.075

We see the result is extremely accurate although the precision of this value is limited by how fine our grid is.

Why is this approach bad?

Although this is a good method for one parameter and low precision requirements, the computational complexity is enormous when we have multi-parameters.

Here we had \(1000\) values for a precision upto two degrees.

If we had one more parameter, we need to construct \(1000 \times 1000\) grid and for 3 parameters, the grid becomes intractable, a whopping \(1000 \times 1000 \times 1000\) computations, that too for a small precision.

We can certainly do smarter by making use of the information from the function and this is done by gradient descent algorithm.

Gradient descent and variations

Gradient descent is a method where we slide along the function in the direction that decreases the cost function, and keep repeating until there is only the tiniest decrease in cost with each step.

Mathematically, it is described by:

\[ \gamma_{t+1} = \gamma_t - \theta \text{ } \nabla C(\gamma_t, X) \]

where \(\gamma_{0}\) is the initial value of the parameter and \(\theta\) is a step size parameter that determines how fast (or slow) we descend.

It is also called the learning rate. We should also have a threshold \(\tau\) that represents how precise we want the cost function to be estimated.

In single variable calculus, \(\nabla C\) is given by the derivative,

\[ \nabla C = \frac{dC}{d\gamma} \]

and hence in our case of diabetic population, it is extremely easy to compute this analytically - we can apply the rules of calculus to calculate the derivative:

\[ \nabla C(\gamma, X) = \frac{200-\text{DIAB_count}}{1-\gamma} - \frac{\text{DIAB_count}}{\gamma} \]

where DIAB_count was determined to be 15.

We can then run the gradient descent. Here we fix the initial parameters: starting value \(\gamma_{0} = 0.95\), \(\theta=0.0001\) and the threshold \(τ=0.000001\).

  # Initial conditions/parameters
  gamma = 0.95
  theta = 0.0001
  tau = 0.000001
  
  # Compute the gradient
  DB.gradient = function(gamma) {
    return(((200-DIAB_count)/(1-gamma)) - (DIAB_count/gamma))
  }
  
  # Compute the cost
  prior.cost = DB.cost(gamma)

  # plot.gammas or plot.costs are for graphing purposes
  plot.gammas = gamma
  plot.costs = prior.cost
  
  running = TRUE
  while(running) {
    # Compute gradient
    grad = DB.gradient(gamma)
    # Compute the next gamma
    gamma = gamma - theta * grad
    # Compute the new costs
    new.cost = DB.cost(gamma)
    # For plotting purposes
    plot.gammas = c(plot.gammas, gamma)
    plot.costs = c(plot.costs, new.cost)
    # See if the sequence converges upto a a certain precision
    if(abs(new.cost - prior.cost) < tau) {
      # If so stop the iteration
      running = FALSE
   } else {
      # Else update the cost function to new cost
      prior.cost = new.cost
    }
  }
  # This is our solution
  writeLines(paste("Gradient descent value:",gamma,sep=' '))

## Gradient descent value: 0.0750250685742153

We can now see the convergence:

  # Make a vector of gammas
  all.gammas = seq(.01,.99,by=.01)
  # Find the costs for all gammas
  all.costs = sapply(all.gammas,DB.cost)
  # Make a data frame from the gammas and the costs computed above
  plot.df = data.frame(gammas=plot.gammas,cost=plot.costs)
  # Plot the cost function and add only the converging points
  qplot(x=all.gammas, y=all.costs, geom='line', xlab=expression(gamma), ylab="Cost") +
    geom_point(data=plot.df, aes(x=gammas,y=cost),color='blue')

Several times we might not have an analytical form for the derivative. Or we out of luck then? We can still apply numerical approximation for the derivative:

\[ \nabla C(\gamma, X) \approx \frac{C(\gamma + \epsilon) - C(\gamma - \epsilon)}{2\epsilon} \]

Here is the code with the numerical approximation for the derivative:

   # Initial conditions/parameters
  gamma = 0.95
  theta = 0.0001
  tau = 0.000001
  
  # Compute the gradient using approximation
  DB.gradient = function(gamma, epsilon = .001) {
    return((DB.cost(gamma+epsilon) - DB.cost(gamma-epsilon)) / (2*epsilon))
  }
  
  # Compute the cost
  prior.cost = DB.cost(gamma)

  # plot.gammas or plot.costs are for graphing purposes
  plot.gammas = gamma
  plot.costs = prior.cost
  
  running = TRUE
  while(running) {
    # Compute gradient
    grad = DB.gradient(gamma)
    # Compute the next gamma
    gamma = gamma - theta * grad
    # Compute the new costs
    new.cost = DB.cost(gamma)
    # For plotting purposes
    plot.gammas = c(plot.gammas, gamma)
    plot.costs = c(plot.costs, new.cost)
    # See if the sequence converges upto a a certain precision
    if(abs(new.cost - prior.cost) < tau) {
      # If so stop the iteration
      running = FALSE
   } else {
      # Else update the cost function to new cost
      prior.cost = new.cost
    }
  }
  # This is our solution
  writeLines(paste("Gradient descent value:", gamma, sep=' '))

## Gradient descent value: 0.0750291264680477

We see the answer is almost the same! Wonderful.

Multi-variable Optimization

So far we saw single variable optimization.

But in general this is not adequate and we may need to optimize on several variables. Let us consider an example.

Glucose test and hemoglobin test (A1C) are two common tests to identify diabetes in a population.

If a person has (fasting) glucose levels greater than or equal to 126 mg/dL or A1C levels greater than or equal to 6.5% then they are diagnosed diabetic.

Suppose we want to estimate the probability of someone being diabetic given these two variables. We can work backwards from our observation to determine this probability.

We define \(P(Diab \mid A1C \geq 6.5 \%) = \theta_{A1C}\) and \(P(Diab \mid GLUCOSE \geq 126 \ mg/dL) = \theta_{GLU}\).

From the laws of probability we have:

\[ P(Diab) = \begin{cases} p & if \ A1C < 6.5\% \ \& \ GLUCOSE < 126 \ mg/dL \\ \theta_{A1C} & if\ A1C \geq 6.5\% \ \& \ GLUCOSE < 126 \ mg/dL \\ \theta_{GLU} & if \ A1C < 6.5\% \ \& \ GLUCOSE \geq 126 \ mg/dL \\ \theta_{A1C}+\theta_{GLU}-\theta_{A1C}*\theta_{GLU} & if \ A1C \geq 6.5\% \ \& \ GLUCOSE \geq 126 \ mg/dL \end{cases} \]

Thus, we have these two hidden variables \(\theta_{A1C}\) and \(\theta_{GLU}\) that determines the probability of someone being diabetic.

Also we have the variable \(p\) that defines the population that is unexplained by the hidden variables.

This value can be directly computed from the observations.

However for the hidden variables, we have to figure out the values from the observations using optimization.

To do this we will extract the observations from the data table:

  # Make logicals based on diagnostic levels of A1C, GLUCOSE and diabetes status
  DB.obs <- select(HANES,A1C, GLUCOSE, DX_DBTS) %>%
            mutate(A1C_L = as.logical(ifelse(A1C >= 6.5,1,0)),
                   GLUCOSE_L= as.logical(ifelse(GLUCOSE >= 126,1,0)),
                   DX_DBTS_L = as.logical(ifelse(DX_DBTS == "DIAB",1,0))) %>%
            select(A1C_L,GLUCOSE_L,DX_DBTS_L)

  # Print the first few lines to see if the data looks okay
  print(head(DB.obs))

##   A1C_L GLUCOSE_L DX_DBTS_L
## 1 FALSE     FALSE     FALSE
## 2 FALSE     FALSE     FALSE
## 3 FALSE     FALSE     FALSE
## 4 FALSE     FALSE     FALSE
## 5 FALSE     FALSE     FALSE
## 6 FALSE     FALSE     FALSE

We can list the data matrix corresponding to the observations and determine the variable \(p\) from it.

  # Print the data matrix corresponding to the observations
  print(with(DB.obs,table(A1C_L,GLUCOSE_L,DX_DBTS_L)))

   # Find the number of people with diabetes
  DBTS_T <- table(DB.obs$DX_DBTS_L == T )
  # Find the number of people with diabetes where
  # A1C_L <= 6.5 and GLUCOSE_L <= 126
  # It is this population that is unexplained by both variables
  A1C_GLU_unexplained <- with(DB.obs,table(A1C_L == F & GLUCOSE_L == F & DX_DBTS_L == T))
  # Find the proportion of people with diabetes that are unexplained, the variable $p$
  DBTS_U <- c(proportion = A1C_GLU_unexplained[2]/DBTS_T[2])
  # Name the column and print the proportion
  names(DBTS_U) <- c("DBTS_U")
  DBTS_U

## , , DX_DBTS_L = FALSE
## 
##        GLUCOSE_L
## A1C_L   FALSE TRUE
##   FALSE   971   10
##   TRUE     13   14
## 
## , , DX_DBTS_L = TRUE
## 
##        GLUCOSE_L
## A1C_L   FALSE TRUE
##   FALSE    21    6
##   TRUE     24   53
##    DBTS_U 
## 0.2019231

We then define our likelihood function. This function takes the parameters in a single argument and returns the cost vector as negative log likelihood function.

  # Define the cost function
  DB.MG.cost = function(params, data=DB.obs) {
    # Receive the parameters in theta variables
    theta.A1C = params[1]
    theta.GLU = params[2]
    
    # Define the probability model as given by the equation above

    predicted.diab.A1C = ifelse(data$A1C_L, theta.A1C, 0)
    predicted.diab.GLU = ifelse(data$GLUCOSE_L, theta.GLU, 0)
    predicted.diab = predicted.diab.A1C + predicted.diab.GLU - 
                      (predicted.diab.A1C * predicted.diab.GLU) 
    # Define the likelihood function
    likelihood_explained = ifelse(data$DX_DBTS_L, predicted.diab, 1-predicted.diab)
    
    likelihood  = ifelse(likelihood_explained == 0, DBTS_U, likelihood_explained)
    
    # Find the cost function as the sum of negative log likelihoods
    cost.DB = -sum(log(likelihood)) 
    # Return the cost
    return(cost.DB)
  }

We can now see what the cost function looks like by varying the conditional probabilities. The lower the cost is, more red will be the point.

  # Define the grid by varing the conditional probabilities between 0.01 to 0.99 by
  # increments of 0.01
  DB.grid = data.frame(A1C = rep(seq(.01,.99,by=.01),times=99),
                        GLU = rep(seq(.01,.99,by=.01),each=99))

  # Find the cost by evaluating the above cost function using the grid just defined
  DB.grid$cost.DB = mapply(function(w,h){DB.MG.cost(c(w,h))}, 
                               DB.grid$A1C, DB.grid$GLU)
  
  # Create a plot the "cost grid" and fill in the values
  DB.plot = ggplot(DB.grid, aes(x=A1C, y=GLU)) +
            geom_raster(aes(fill=cost.DB)) + 
            scale_fill_gradient(low='red', high='yellow', trans = 'log') +
            stat_contour(aes(z=cost.DB),binwidth=5, color = 'black', linetype = 'dashed') +
            xlab('P(Diabetes | A1C)') + ylab('P(Diabetes | Glucose)')
  # Make a plot
  print(DB.plot)

We see that the cost is minimized somewhere around 0.6 and 0.5.

We hope our optimization techniques will give us the values of the hidden variables around this value.

The optimization for multiple variables \(\gamma_{0}, \gamma_{1}, \dots \gamma_{N}\), using gradient descent is just like the optimization for a single variable.

Here the gradient function is given by:

\[ \nabla C(\gamma, X) = \bigg[ \frac{\partial C}{\partial \gamma_0}, \frac{\partial C}{\partial \gamma_1}, ..., \frac{\partial C}{\partial \gamma_N} \bigg] \]

in particular in our case,

\[ \nabla C(\gamma, X) = \bigg[ \frac{\partial C}{\partial \theta_{A1C}}, \frac{\partial C}{\partial \theta_{GLU}} \bigg] \]

which can be approximated like we did for the single variable, however we have to do it for each variable here.

  # Find gradient through approximation
  DB.gradient = function(params, data=DB.obs, epsilon = .001) {

    # Find A1C gradient by perturbing in both directions
    A1C.plus  = DB.MG.cost(params + c(epsilon,0), DB.obs)
    A1C.minus = DB.MG.cost(params + c(-epsilon,0), DB.obs)
    df.dA1C = (A1C.plus - A1C.minus) / (2*epsilon)

    # Find GLU gradient by perturbing in both directions
    GLU.plus  = DB.MG.cost(params + c(0,epsilon), DB.obs)
    GLU.minus = DB.MG.cost(params + c(0, -epsilon), DB.obs)
    df.dGLU = (GLU.plus - GLU.minus) / (2*epsilon)
  
    # Return the gradient
    return(c(df.dA1C, df.dGLU))
  }

This helps us to see how the gradient descent works in two dimensions.

The gradient points towards the direction of lowering cost function. If we are too much away from the bottom we need to take larger steps, otherwise smaller steps are sufficient for convergence.

  # Demostrate the gradient descent in two dimensions using arrows
  # Make an arrow using given gradient
  make.gradient.arrow = function(params) {
    grad = DB.gradient(params)
    end.x = params[1] - grad[1]*.001
    end.y = params[2] - grad[2]*.001
    return(geom_segment(x=params[1],y=params[2],xend=end.x,yend=end.y, arrow = arrow(length=unit(.02, 'npc')),color='blue'))
  }

  # Make the arrow + cost grid plot
  DB.grad.plot = DB.plot +
        make.gradient.arrow(c(.2,.2)) + make.gradient.arrow(c(.8,.8)) + 
        make.gradient.arrow(c(.2,.8)) + make.gradient.arrow(c(.8,.2))

  # Plot the result
  print(DB.grad.plot)

We are now ready to see gradient descent in action.

It works exactly as in one dimensional case, except that we need to work with vectors.

We need to initalize the conditional probablities \(\theta_{A1C}\) and \(\theta_{GLU}\).

We will assume they have a great predictive power of about 90%.

Lets fix the convergence threshold, \(\tau\) as 0.01 and the learning parameter, now denoted by \(\alpha\) as 0.0001.

  # Initialize the parameters
  theta = c(.9, .9)
  alpha = 0.0001
  tau = .01

  # Compute the prior run's cost
  prior.cost = DB.MG.cost(theta)
  running = TRUE

  # Plots for showing the descent
  plt.xs = theta[1]
  plt.ys = theta[2]

  while(running) {
    # Compute the gradient
    grad = DB.gradient(theta)
    # Update the probabilities
    theta = theta - alpha * grad
    # For plotting purposes
    plt.xs = c(plt.xs, theta[1])
    plt.ys = c(plt.ys, theta[2])
    # Compute the new cost
    new.cost = DB.MG.cost(theta)
    # Check if the cost sequence converges upto the given threshold
    if(abs(new.cost - prior.cost) < tau) {
      # If the cost gets converged, stop the iteration
      running = FALSE
    } else {
      # Else update the cost
      prior.cost = new.cost
    }
  }

  # Code to plot the descent path
  grad.plot = DB.plot
  for (i in 2:length(plt.xs)) {
    grad.plot = grad.plot + geom_segment(x = plt.xs[i-1], xend = plt.xs[i],
                                         y = plt.ys[i-1], yend = plt.ys[i],
                                         color = 'blue')
  }
  # Make the plot
  grad.plot = grad.plot + annotate('text', x = plt.xs[1], y = plt.ys[1], color = 'blue', label = 'O') +
                                    annotate('text', x = theta[1], y = theta[2], color = 'blue', label = 'X')

  # Plot the results
  print(grad.plot)

We can now see the values of the hidden variables \(\theta_{A1C}\) and \(\theta_{GLU}\).

  # Print the values of the conditional probabilites P(diab|A1C) and P(diab|GLU)
  writeLines(paste("Theta_A1C: ",theta[1],"\nTheta_GLU: ",theta[2],sep=''))

## Theta_A1C: 0.620294470407629
## Theta_GLU: 0.48994822007435

Thus, the probability of observing someone with diabetes given A1C value is 62% and the probablity given GLUCOSE value is 49% and given both values the probablity is 81%.

From this analysis we can conclude it is best to use both variables, than just either A1C or GLUCOSE.

Implementation

Several of these optimization techniques are already implemented in R. The package/function is called optim(). All this function needs are two inputs - the initial guess for the parameters and the cost function:

  # Use the optim package to evaluate the parameters
  o = optim(c(.95,.95), DB.MG.cost)
  print(o)

## $par
## [1] 0.6546962 0.3833888
## 
## $value
## [1] 102.5192
## 
## $counts
## function gradient 
##       67       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

Note that the result varies but kinda compares to our results we obtained through gradient descent (62% vs 65% for A1C and 48% vs 38% for GLUCOSE).

The difference (especially for the glucose variable) is most likely due to the existence of multiple minimas (we see “red” regions throughout) and the default method option() uses - the Nelder-Mead method although is computationally more efficient, can get struck in a minimum.

The gradient descent too will suffer from this problem depending on the initial conditions, as we will see shortly.

There are other algorithms for optimization which can be easily used with the optim() function.

The BFGS algorithm is one such algorithm.

To use this, we just need to specify method='BGFS' argument of the package.

  # Use the optim package with BGFS method to evaluate the parameters
  o = optim(c(.95,.95), DB.MG.cost, method = 'BFGS')
  print(o)

## $par
## [1] 0.6547071 0.3831999
## 
## $value
## [1] 102.5192
## 
## $counts
## function gradient 
##       31       10 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

Further, the optim() function allows us to specify the gradiant function if we know them analytically. This can be done by specifying the argument gr. In our case,

  # Use the optim package with BGFS method to evaluate the parameters
  # by specifying the gradient function
  o = optim(c(.95,.95), DB.MG.cost, method = 'BFGS', gr = DB.gradient)
  print(o)

## $par
## [1] 0.6547071 0.3831999
## 
## $value
## [1] 102.5192
## 
## $counts
## function gradient 
##       31       10 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

And finally, the L-BFSG-B algorithm will allow us to perform constrained optimizaion where we can specify the lower and upper bounds on the parameters. Here since they are probabilities they vary between 0 and 1. And so we can specify:

  # Use the optim package with L-BFGS-B method and specifying the upper and lower
  # limits of the parameters
  o = optim(c(.95,.95), DB.MG.cost, method = 'L-BFGS-B', 
            lower = c(.00001, .00001), upper = c(.99999, .99999))
  print(o)

## $par
## [1] 0.6547060 0.3832008
## 
## $value
## [1] 102.5192
## 
## $counts
## function gradient 
##       10       10 
## 
## $convergence
## [1] 0
## 
## $message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

Big Data Optimization (Stochastic Gradient Descent)

Note that in the previous algorithm we described, every step of the descent required a call to compute the cost.

In the case of HANES data, it was managable since we had 1112 observations.

What if there are a several millions of observations? For instance MIMIC3 has tens of millions of observations.

The total number of calls would be formidable. In such cases, we can make use of stochastic gradient descent.

Stochastic gradient descent works on the principle that when the observations are independent the cost function can be decomposed

  • the cost of errors on one observation is independent of any other observation and

randomly choosing some observations we can get a good estimate of the cost function.

Mathematically, \[ C(\gamma, X) \approx \sum_{i=1}^{N}C(\gamma, x_i) \] where \(x_i\)’s are individual observations chosen randomly.

Thus, we can perform stochastic gradient descent choosing only 100 observations at each time step to estimate the cost in our demostration of finding the probability of person being diabetic given A1C and GLUCOSE variables.

  # Initialize the parameters  
  theta = c(.9, .9)
  alpha_init= .0001
  tau = .01

  # Choose the number of observations we want to use
  n.obs = 100 
  
  # Define prior cost
  prior.cost = DB.MG.cost(theta, DB.obs)
  running = TRUE
  nth.iter = 0

  # Plots for showing the descent
  plt.xs = theta[1]
  plt.ys = theta[2]
  plt.ns = NULL

  while(running) {
    # Iterate until the total cost converges
    nth.iter = nth.iter + 1
    alpha = alpha_init / nth.iter
  
    # Choose random sample from the data set to estimate the cost
    obs.order = sample(1:n.obs, n.obs) 
    tot.cost = 0
  
    # For those observations estimate the total cost
    for(d.i in obs.order) {
      # Find the gradient and theta
      grad = DB.gradient(theta, DB.obs[d.i,])
      theta = theta-alpha*grad
      # For plotting purposes, append the thetas
      plt.xs = c(plt.xs, theta[1])
      plt.ys = c(plt.ys, theta[2])
      plt.ns = c(plt.ns, nth.iter)
      # Find the total cost for only those the random sample of observations
      tot.cost = tot.cost + DB.MG.cost(theta, DB.obs[d.i,])
    }
    # If the sequence converges within our threshold, stop else continue
    if(abs(tot.cost - prior.cost) < tau) {
      running = FALSE
    } else {
      prior.cost = tot.cost
    }
  }
  # Print the final theta values
  writeLines(paste("Theta_A1C: ",theta[1],"\nTheta_GLU: ",theta[2],sep=''))

## Theta_A1C: 0.632888573236608
## Theta_GLU: 0.44020137329145

We see the values are really close to what we estimated by multi-variable gradient descent. We can infact plot the convergence of the iterations by giving different colors - first iteration purple, second yellow etc.

In fact, if we zoon in, we can see how the convergence behaves:

Thus, we only need three iterations for convergence by choosing only 100 observations. Thus, stochastic gradient descent may save us substantially if we work with big data.

Using stochastic gradient with parallel frameworks

In our example, it is established that A1C values greater than or equal to 6.5 and glucose levels greater than or equal to 126 are indicators of diabetes.

In case if we don’t know this information, one can iterate through pairs of numbers (A1C value, GLUCOSE value) to estimate the pair of values that gives maximum probability of person having diabetes.

Note: these computations are independent on a number pair-level

Thus, one can parallelize such computations through frameworks such as Map-Reduce/High performance computing and

stochastic gradient can be used to detrmine the minimal cost for each such pair

Thus we can eventually chooing the cost (and the pair) that is minimum of all minimums.

Such analysis will give us the thresholds for classifying diabetic populations.

This type of analysis can also be extended for other variables.

Local Minima and Convex Functions

Consider this cost function:

\[ C(\theta) = \theta * (\theta + 1) * (\theta + 1.5) * (\theta * 2) \] Let us compute the minimum for this function through gradient descent.

  # Define the function
  arbitrary.cost = function(theta) {
    return(theta*(theta+1)*(theta+1.5)*(theta-2))
  }
  
  # Find its gradient
  arbitrary.gradient = function(theta, epsilon = 0.001) {
    return( (arbitrary.cost(theta+epsilon) - arbitrary.cost(theta-epsilon)) / (2*epsilon) )
  }

  # Initalize parameters
  theta = -2.5
  gamma = .0001
  tau = .000001

  prior.cost = arbitrary.cost(theta)

  running = TRUE
  
  # Run the descent algorithm
  while(running) {
    grad = arbitrary.gradient(theta)
    theta = theta - gamma * grad
    new.cost = arbitrary.cost(theta)
  
    if(abs(new.cost - prior.cost)< tau) {
      running = FALSE
    } else {
      prior.cost = new.cost
    }
  }
  # Write the output
  writeLines(paste("Gradient descent value:",theta,sep=' '))

## Gradient descent value: -1.29429022807873

Great! Are we done? Let’s look at the function:

  # Plot function vs costs
  thetas = seq(-3,3,by=.01)
  costs = sapply(thetas,arbitrary.cost)
  qplot(thetas,costs,geom='line',xlab=expression(theta),ylab='Cost') + coord_cartesian(ylim=c(-8,10))

We see the minimum is actually 1.3, however we got -1.29 as the answer. Why is that? Because there are two types of minimum.

Global minimum: a point \(x^{*}\) is a global minimum if it is the lowest value of the function across all allowable values of the parameter.

Local minimum: a point \(x^{*}\) is a local minimum if it is the lowest value of the function within some range centered on \(x^{*}\).

Here 1.29 is the local minimum and 1.3 is the global minimum and gradient descent is highly sensitive to initial conditions.

That is why it is important to try several initial conditions.

Gradient descent doesn’t give any claims to giving global optimal results.

The result could equally be locally optimal or sub-optimal.

Therefore are we out of luck? Not really!

There are some class of functions where there can be no different types of minima. In other words, all local minimim are global minimums.

These are called convex functions and there are lots of methods to do optimization for this nice class of functions.

This sub-field of optmization is called convex optimization.

Loosely speaking, a function is convex if for every line segment drawn between two points on the function, that line segment passes above the graph.

A function that is not convex is called a non-convex function.

For example, the above function is non-convex, the reason why it has multiple minimums.

Here is a rigorous proof of why convex functions have only global minimums. Mathematically, convex functions are defined by:

\[ f(λy + (1 − λ)x) ≤ λf(y) + (1 − λ)f(x) \]

Theorem: For a convex function defined on a convex set, any local minimum is also a global minimum.

Proof: Let \(f:C \rightarrow \mathbb{R}\), be a convex function, on a convex set \(C\). Suppose that \(x\) is a local minimum of \(f\) but not a global one. This means \(\exists y \in C\) such that \(f(y) < f(x)\). By definition of convexity for any \(\lambda \in [0,1]\), we have,

\[ f(x+\lambda (y-x)) \leq f(x) + \lambda [f(y) - f(x)] \\ < f(x) \] since \(f(y) < f(x)\). Hence, for any small ball \(B(x;\epsilon)\) around \(x\), there exists \(z\) such that \(f(z) < f(x)\). Therefore, \(x\) is not a local minimum - a contradiction. QED.

References

  1. An excellent tutorial here: Introduction to Optimization