Markov and Hidden Markov Models

Recorded Stream

A multinomial model of DNA sequence evolution

The simplest model of DNA sequence evolution.
Assumes that the sequence has been produced by a random process that randomly chose any of the four nucleotides at each position in the sequence.
The probability of choosing any one of the four nucleotides depends on a predetermined probability distribution. That is, the four nucleotides are chosen with probabilities pA, pC, pG, and pT respectively. This is known as the multinomial sequence model.

Generating a DNA sequence using multinomial model

We will generate a DNA sequence with A, C, T and G appearing with probabilities pA=0.2, pC=0.3, pT=0.2 and pG=0.3

 # Define the alphabet of nucleotides
 nucleotides <- c("A", "C", "G", "T")
 # Set the values of the probabilities
 probabilities1 <- c(0.2, 0.3, 0.3, 0.2)
 # Set the length of the sequence
 seqlength <- 30
 # Generate a sequence
 sample(nucleotides, seqlength, rep=TRUE, prob=probabilities1)

##  [1] "A" "G" "T" "A" "G" "A" "G" "G" "G" "C" "T" "G" "G" "C" "T" "G" "G"
## [18] "T" "G" "A" "A" "A" "C" "C" "G" "G" "C" "C" "T" "G"

The multinomial model is a good model in several cases but also has limitations.

It assumes the frequency of the nucleotides is the same for all streches of DNA, i.e., pA, pC, PT and pG is the same throughout.
The probability of appearence of a nuclotide is independent of what nucleotide appears before. This is mostly true but there are several regions of the genome, such as CpG islands, where the nucleotides appear in a dependent manner.

A Markov sequence model is a better representation.

The transition matrix for a Markov model

We can construct transition matrix for such Markov sequence model.

 # Define the alphabet of nucleotides
 nucleotides <- c("A", "C", "G", "T") 
 # Set the values of the probabilities, where the previous nucleotide was "A"
 afterAprobs <- c(0.2, 0.3, 0.3, 0.2)        
 # Set the values of the probabilities, where the previous nucleotide was "C"
 afterCprobs <- c(0.1, 0.41, 0.39, 0.1)      
 # Set the values of the probabilities, where the previous nucleotide was "G"
 afterGprobs <- c(0.25, 0.25, 0.25, 0.25)     
 # Set the values of the probabilities, where the previous nucleotide was "T"
 afterTprobs <- c(0.5, 0.17, 0.17, 0.17)      
 # Create a 4 x 4 matrix
 mytransitionmatrix <- matrix(c(afterAprobs, afterCprobs, afterGprobs, afterTprobs), 4, 4, byrow = TRUE) 
 rownames(mytransitionmatrix) <- nucleotides
 colnames(mytransitionmatrix) <- nucleotides
 # Print out the transition matrix
 mytransitionmatrix

##      A    C    G    T
## A 0.20 0.30 0.30 0.20
## C 0.10 0.41 0.39 0.10
## G 0.25 0.25 0.25 0.25
## T 0.50 0.17 0.17 0.17

Rows 1, 2, 3 and 4 of the transition matrix give the probabilities pA, pC, pG, and pT for the cases where the previous nucleotide was “A”, “C”, “G”, or “T”, respectively. That is, the element in a particular row and column of the transition matrix (eg. the row for “A”, column for “C”) holds the probability (pAC) of choosing a particular nucleotide (“C”) at the current position in the sequence, given that was a particular nucleotide (“A”) at the previous position in the sequence.

Generating a DNA sequence using a Markov model

Just like we generated a DNA sequence using a particular multinomial model, we can generate a DNA sequence using a particular Markov model. When we are generating a DNA sequence using a Markov model, the nucleotide chosen at each position at the sequence depends on the nucleotide chosen at the previous position. As there is no previous nucleotide at the first position in the new sequence, we need to define the probabilities of choosing “A”, “C”, “G” or “T” for the first position.

  # Function to generate the sequences using the Markov model
  generatemarkovseq <- function(transitionmatrix, initialprobs, seqlength) {
    # Define the alphabet of nucleotides
    nucleotides <- c("A", "C", "G", "T")
    # Create a vector for storing the new sequence
    mysequence <- character()
    # Choose the nucleotide for the first position in the sequence:
    firstnucleotide <- sample(nucleotides, 1, rep=TRUE, prob=initialprobs)
    # Store the nucleotide for the first position of the sequence
    mysequence[1] <- firstnucleotide
    for (i in 2:seqlength) {
      # Get the previous nucleotide in the new sequence
      prevnucleotide <- mysequence[i-1]
      # Get the probabilities of the current nucleotide, given previous nucleotide "prevnucleotide":
      probabilities <- transitionmatrix[prevnucleotide,]
      # Choose the nucleotide at the current position of the sequence:
      nucleotide <- sample(nucleotides, 1, rep=TRUE, prob=probabilities)
      # Store the nucleotide for the current position of the sequence
      mysequence[i] <- nucleotide
    }
    return(mysequence)
  }

We assume the initial probablities are equally likely for the four nucleotides:

  # Define initial probablities
  myinitialprobs <- c(0.25, 0.25, 0.25, 0.25)
  # Generate the sequence
  generatemarkovseq(mytransitionmatrix, myinitialprobs, 30)

##  [1] "C" "C" "G" "C" "G" "A" "T" "G" "T" "C" "G" "G" "A" "T" "T" "C" "C"
## [18] "C" "G" "A" "C" "G" "G" "C" "C" "C" "C" "G" "A" "G"

As we can see, there are many “A”s after “T”s in the sequence. This is because pTA has a high value (0.5) in the Markov transition matrix mytransitionmatrix. Similarly, there are few “A”s or “T”s after “C”s, which is because pCA and pCT have low values (0.1) in this transition matrix.

A Hidden Markov Model of DNA sequence evolution

In a Markov model, the nucleotide at a particular position in a sequence depends on the nucleotide found at the previous position. In contrast, in a Hidden Markov model (HMM), the nucleotide found at a particular position in a sequence depends on the state at the previous nucleotide position in the sequence. The state at a sequence position is a property of that position of the sequence, for example, a particular HMM may model the positions along a sequence as belonging to either one of two states, GC-rich or AT-rich. A more complex HMM may model the positions along a sequence as belonging to many different possible states, such as “promoter”, “exon”, “intron”, and “intergenic DNA”.

We will use GC-rich and AT-rich as hidden states and create a transition and emission matrices. First, we make the transition matrix that shows the probabilities when the states change.

  # Define the names of the hidden states
  states <- c("AT-rich", "GC-rich")
  # Set the probabilities of switching states, where the previous state was "AT-rich"
  ATrichprobs <- c(0.7, 0.3)
  # Set the probabilities of switching states, where the previous state was "GC-rich"
  GCrichprobs <- c(0.1, 0.9)
  # Create a 2 x 2 matrix
  thetransitionmatrix <- matrix(c(ATrichprobs, GCrichprobs), 2, 2, byrow = TRUE)
  rownames(thetransitionmatrix) <- states
  colnames(thetransitionmatrix) <- states
  # Print out the transition matrix
  thetransitionmatrix

##         AT-rich GC-rich
## AT-rich     0.7     0.3
## GC-rich     0.1     0.9

The second important matrix is the HMM emission matrix, which holds the probabilities of choosing the four nucleotides “A”, “C”, “G”, and “T”, in each of the states.

In a HMM with an AT-rich state and a GC-rich state, the emission matrix will hold the probabilities of choosing each of the four nucleotides “A”, “C”, “G” and “T” in the AT-rich state (for example, pA=0.39, pC=0.1, pG=0.1, and pT=0.41 for the AT-rich state), and the probabilities of choosing “A”, “C”, “G”, and “T” in the GC-rich state (for example, pA=0.1, pC=0.41, pG=0.39, and pT=0.1 for the GC-rich state).

  # Define the alphabet of nucleotides
  nucleotides <- c("A", "C", "G", "T")
  # Set the values of the probabilities, for the AT-rich state
  ATrichstateprobs <- c(0.39, 0.1, 0.1, 0.41)
  # Set the values of the probabilities, for the GC-rich state
  GCrichstateprobs <- c(0.1, 0.41, 0.39, 0.1)
  # Create a 2 x 4 matrix
  theemissionmatrix <- matrix(c(ATrichstateprobs, GCrichstateprobs), 2, 4, byrow = TRUE)
  rownames(theemissionmatrix) <- states
  colnames(theemissionmatrix) <- nucleotides
  # Print out the emission matrix
  theemissionmatrix

##            A    C    G    T
## AT-rich 0.39 0.10 0.10 0.41
## GC-rich 0.10 0.41 0.39 0.10

Once the transition and emission matrices are given we can generate a DNA sequence using HMM.

# Function to generate a DNA sequence, given a HMM and the length of the sequence to be generated.
  generatehmmseq <- function(transitionmatrix, emissionmatrix, initialprobs, seqlength)
  {
     nucleotides     <- c("A", "C", "G", "T")   # Define the alphabet of nucleotides
     states          <- c("AT-rich", "GC-rich") # Define the names of the states
     mysequence      <- character()             # Create a vector for storing the new sequence
     mystates        <- character()             # Create a vector for storing the state that each position in the new sequence
                                                # was generated by
     # Choose the state for the first position in the sequence:
     firststate      <- sample(states, 1, rep=TRUE, prob=initialprobs)
     # Get the probabilities of the current nucleotide, given that we are in the state "firststate":
     probabilities   <- emissionmatrix[firststate,]
     # Choose the nucleotide for the first position in the sequence:
     firstnucleotide <- sample(nucleotides, 1, rep=TRUE, prob=probabilities)
     mysequence[1]   <- firstnucleotide         # Store the nucleotide for the first position of the sequence
     mystates[1]     <- firststate              # Store the state that the first position in the sequence was generated by

     for (i in 2:seqlength)
     {
        prevstate    <- mystates[i-1]           # Get the state that the previous nucleotide in the sequence was generated by
        # Get the probabilities of the current state, given that the previous nucleotide was generated by state "prevstate"
        stateprobs   <- transitionmatrix[prevstate,]
        # Choose the state for the ith position in the sequence:
        state        <- sample(states, 1, rep=TRUE, prob=stateprobs)
        # Get the probabilities of the current nucleotide, given that we are in the state "state":
        probabilities <- emissionmatrix[state,]
        # Choose the nucleotide for the ith position in the sequence:
        nucleotide   <- sample(nucleotides, 1, rep=TRUE, prob=probabilities)
        mysequence[i] <- nucleotide             # Store the nucleotide for the current position of the sequence
        mystates[i]  <- state                   # Store the state that the current position in the sequence was generated by
     }

     for (i in 1:length(mysequence))
     {
        nucleotide   <- mysequence[i]
        state        <- mystates[i]
        print(paste("Position", i, ", State", state, ", Nucleotide = ", nucleotide))
     }
  }

When we are generating a DNA sequence using a HMM, the nucleotide is chosen at each position depending on the state at the previous position in the sequence. As there is no previous nucleotide at the first position in the sequence, the function generatehmmseq() also requires the probabilities of the choosing each of the states at the first position.

  # Define the initial probabilities
  theinitialprobs <- c(0.5, 0.5)
  # Generate a HMM DNA sequence
  generatehmmseq(thetransitionmatrix, theemissionmatrix, theinitialprobs, 30)

## [1] "Position 1 , State AT-rich , Nucleotide =  A"
## [1] "Position 2 , State AT-rich , Nucleotide =  T"
## [1] "Position 3 , State GC-rich , Nucleotide =  C"
## [1] "Position 4 , State GC-rich , Nucleotide =  C"
## [1] "Position 5 , State GC-rich , Nucleotide =  G"
## [1] "Position 6 , State GC-rich , Nucleotide =  G"
## [1] "Position 7 , State GC-rich , Nucleotide =  C"
## [1] "Position 8 , State GC-rich , Nucleotide =  C"
## [1] "Position 9 , State AT-rich , Nucleotide =  T"
## [1] "Position 10 , State AT-rich , Nucleotide =  A"
## [1] "Position 11 , State AT-rich , Nucleotide =  T"
## [1] "Position 12 , State AT-rich , Nucleotide =  T"
## [1] "Position 13 , State GC-rich , Nucleotide =  G"
## [1] "Position 14 , State GC-rich , Nucleotide =  A"
## [1] "Position 15 , State GC-rich , Nucleotide =  G"
## [1] "Position 16 , State GC-rich , Nucleotide =  G"
## [1] "Position 17 , State GC-rich , Nucleotide =  A"
## [1] "Position 18 , State GC-rich , Nucleotide =  C"
## [1] "Position 19 , State GC-rich , Nucleotide =  G"
## [1] "Position 20 , State GC-rich , Nucleotide =  C"
## [1] "Position 21 , State GC-rich , Nucleotide =  C"
## [1] "Position 22 , State GC-rich , Nucleotide =  G"
## [1] "Position 23 , State GC-rich , Nucleotide =  A"
## [1] "Position 24 , State GC-rich , Nucleotide =  C"
## [1] "Position 25 , State GC-rich , Nucleotide =  C"
## [1] "Position 26 , State GC-rich , Nucleotide =  G"
## [1] "Position 27 , State GC-rich , Nucleotide =  T"
## [1] "Position 28 , State GC-rich , Nucleotide =  C"
## [1] "Position 29 , State GC-rich , Nucleotide =  T"
## [1] "Position 30 , State GC-rich , Nucleotide =  C"

As we can see, the nucleotides generated by the GC-rich state are mostly but not all “G”s and “C”s (because of the high values of pG and pC for the GC-rich state in the HMM emission matrix), while the nucleotides generated by the AT-rich state are mostly but not all “A”s and “T”s (because of the high values of pT and pA for the AT-rics state in the HMM emission matrix).

Furthermore, there tends to be runs of nucleotides that are either all in the GC-rich state or all in the AT-rich state, as the transition matrix specifies that the probabilities of switching from the AT-rich to GC-rich state (probability 0.3), or GC-rich to AT-rich state (probability 0.1) are relatively low.

Inferring the hidden states of HMM - Viterbi Algorithm

If we have a HMM with two states, “GC-rich” and “AT-rich”, and we know the transmission and emission matrices of the HMM, can we take some new DNA sequence, and figure out which state (GC-rich or AT-rich) is the most likely to have generated each nucleotide position in that DNA sequence?

This is done using Viterbi’s algorithm

viterbi <- function(sequence, transitionmatrix, emissionmatrix)
  # This carries out the Viterbi algorithm.
  # Adapted from "Applied Statistics for Bioinformatics using R" by Wim P. Krijnen, page 209
  # ( cran.r-project.org/doc/contrib/Krijnen-IntroBioInfStatistics.pdf )
  {
     # Get the names of the states in the HMM:
     states <- rownames(theemissionmatrix)

     # Make the Viterbi matrix v:
     v <- makeViterbimat(sequence, transitionmatrix, emissionmatrix)

     # Go through each of the rows of the matrix v (where each row represents
     # a position in the DNA sequence), and find out which column has the
     # maximum value for that row (where each column represents one state of
     # the HMM):
     mostprobablestatepath <- apply(v, 1, function(x) which.max(x))

     # Print out the most probable state path:
     prevnucleotide <- sequence[1]
     prevmostprobablestate <- mostprobablestatepath[1]
     prevmostprobablestatename <- states[prevmostprobablestate]
     startpos <- 1
     for (i in 2:length(sequence))
     {
        nucleotide <- sequence[i]
        mostprobablestate <- mostprobablestatepath[i]
        mostprobablestatename <- states[mostprobablestate]
        if (mostprobablestatename != prevmostprobablestatename)
        {
           print(paste("Positions",startpos,"-",(i-1), "Most probable state = ", prevmostprobablestatename))
           startpos <- i
        }
        prevnucleotide <- nucleotide
        prevmostprobablestatename <- mostprobablestatename
     }
     print(paste("Positions",startpos,"-",i, "Most probable state = ", prevmostprobablestatename))
}

makeViterbimat <- function(sequence, transitionmatrix, emissionmatrix)
  # This makes the matrix v using the Viterbi algorithm.
  # Adapted from "Applied Statistics for Bioinformatics using R" by Wim P. Krijnen, page 209
  # ( cran.r-project.org/doc/contrib/Krijnen-IntroBioInfStatistics.pdf )
  {
     # Change the sequence to uppercase
     sequence <- toupper(sequence)
     # Find out how many states are in the HMM
     numstates <- dim(transitionmatrix)[1]
     # Make a matrix with as many rows as positions in the sequence, and as many
     # columns as states in the HMM
     v <- matrix(NA, nrow = length(sequence), ncol = dim(transitionmatrix)[1])
     # Set the values in the first row of matrix v (representing the first position of the sequence) to 0
     v[1, ] <- 0
     # Set the value in the first row of matrix v, first column to 1
     v[1,1] <- 1
     # Fill in the matrix v:
     for (i in 2:length(sequence)) # For each position in the DNA sequence:
     {
        for (l in 1:numstates) # For each of the states of in the HMM:
        {
           # Find the probabilility, if we are in state l, 
           # of choosing the nucleotide at position in the sequence
           statelprobnucleotidei <- emissionmatrix[l,sequence[i]]

           # v[(i-1),] gives the values of v for the (i-1)th row of v, 
           # ie. the (i-1)th position in the sequence.
           # In v[(i-1),] there are values of v at the (i-1)th row of the sequence for each possible state k.
           # v[(i-1),k] gives the value of v at the (i-1)th row of the sequence for a particular state k.

           # transitionmatrix[l,] gives the values in the lth row of the transition matrix, 
           # xx should not be transitionmatrix[,l]?
           # probabilities of changing from a previous state k to a current state l.

           # max(v[(i-1),] * transitionmatrix[l,]) is the maximum probability for the nucleotide observed
           # at the previous position in the sequence in state k, followed by a transition from previous
           # state k to current state l at the current nucleotide position.

           # Set the value in matrix v for row i (nucleotide position i), column l (state l) to be:
           v[i,l] <-  statelprobnucleotidei * max(v[(i-1),] * transitionmatrix[,l])
        }
    }
    return(v)
}

Given a HMM, and a particular DNA sequence, we can use the above Viterbi function to find the state of that HMM that was most likely to have generated the nucleotide at each position in the DNA sequence. For example,

  # Define a sequence
  myseq <- c("A", "A", "G", "C", "G", "T", "G", "G", "G", "G", "C", "C", "C", "C", "G", "G", "C", "G", "A", "C", "A", "T", "G", "G", "G", "G", "T", "G", "T", "C")
  # We call the Viterbi's algorithm we just defined to identify the states
  viterbi(myseq, thetransitionmatrix, theemissionmatrix)

## [1] "Positions 1 - 2 Most probable state =  AT-rich"
## [1] "Positions 3 - 21 Most probable state =  GC-rich"
## [1] "Positions 22 - 22 Most probable state =  AT-rich"
## [1] "Positions 23 - 30 Most probable state =  GC-rich"

Selected materials and references

Hidden Markov Models