We will generate a DNA sequence with A, C, T and G appearing with probabilities pA=0.2, pC=0.3, pT=0.2 and pG=0.3
# Define the alphabet of nucleotides
nucleotides <- c("A", "C", "G", "T")
# Set the values of the probabilities
probabilities1 <- c(0.2, 0.3, 0.3, 0.2)
# Set the length of the sequence
seqlength <- 30
# Generate a sequence
sample(nucleotides, seqlength, rep=TRUE, prob=probabilities1)
## [1] "A" "G" "T" "A" "G" "A" "G" "G" "G" "C" "T" "G" "G" "C" "T" "G" "G"
## [18] "T" "G" "A" "A" "A" "C" "C" "G" "G" "C" "C" "T" "G"
The multinomial model is a good model in several cases but also has limitations.
A Markov sequence model is a better representation.
We can construct transition matrix for such Markov sequence model.
# Define the alphabet of nucleotides
nucleotides <- c("A", "C", "G", "T")
# Set the values of the probabilities, where the previous nucleotide was "A"
afterAprobs <- c(0.2, 0.3, 0.3, 0.2)
# Set the values of the probabilities, where the previous nucleotide was "C"
afterCprobs <- c(0.1, 0.41, 0.39, 0.1)
# Set the values of the probabilities, where the previous nucleotide was "G"
afterGprobs <- c(0.25, 0.25, 0.25, 0.25)
# Set the values of the probabilities, where the previous nucleotide was "T"
afterTprobs <- c(0.5, 0.17, 0.17, 0.17)
# Create a 4 x 4 matrix
mytransitionmatrix <- matrix(c(afterAprobs, afterCprobs, afterGprobs, afterTprobs), 4, 4, byrow = TRUE)
rownames(mytransitionmatrix) <- nucleotides
colnames(mytransitionmatrix) <- nucleotides
# Print out the transition matrix
mytransitionmatrix
## A C G T
## A 0.20 0.30 0.30 0.20
## C 0.10 0.41 0.39 0.10
## G 0.25 0.25 0.25 0.25
## T 0.50 0.17 0.17 0.17
Rows 1, 2, 3 and 4 of the transition matrix give the probabilities pA, pC, pG, and pT for the cases where the previous nucleotide was “A”, “C”, “G”, or “T”, respectively. That is, the element in a particular row and column of the transition matrix (eg. the row for “A”, column for “C”) holds the probability (pAC) of choosing a particular nucleotide (“C”) at the current position in the sequence, given that was a particular nucleotide (“A”) at the previous position in the sequence.
Just like we generated a DNA sequence using a particular multinomial model, we can generate a DNA sequence using a particular Markov model. When we are generating a DNA sequence using a Markov model, the nucleotide chosen at each position at the sequence depends on the nucleotide chosen at the previous position. As there is no previous nucleotide at the first position in the new sequence, we need to define the probabilities of choosing “A”, “C”, “G” or “T” for the first position.
# Function to generate the sequences using the Markov model
generatemarkovseq <- function(transitionmatrix, initialprobs, seqlength) {
# Define the alphabet of nucleotides
nucleotides <- c("A", "C", "G", "T")
# Create a vector for storing the new sequence
mysequence <- character()
# Choose the nucleotide for the first position in the sequence:
firstnucleotide <- sample(nucleotides, 1, rep=TRUE, prob=initialprobs)
# Store the nucleotide for the first position of the sequence
mysequence[1] <- firstnucleotide
for (i in 2:seqlength) {
# Get the previous nucleotide in the new sequence
prevnucleotide <- mysequence[i-1]
# Get the probabilities of the current nucleotide, given previous nucleotide "prevnucleotide":
probabilities <- transitionmatrix[prevnucleotide,]
# Choose the nucleotide at the current position of the sequence:
nucleotide <- sample(nucleotides, 1, rep=TRUE, prob=probabilities)
# Store the nucleotide for the current position of the sequence
mysequence[i] <- nucleotide
}
return(mysequence)
}
We assume the initial probablities are equally likely for the four nucleotides:
# Define initial probablities
myinitialprobs <- c(0.25, 0.25, 0.25, 0.25)
# Generate the sequence
generatemarkovseq(mytransitionmatrix, myinitialprobs, 30)
## [1] "C" "C" "G" "C" "G" "A" "T" "G" "T" "C" "G" "G" "A" "T" "T" "C" "C"
## [18] "C" "G" "A" "C" "G" "G" "C" "C" "C" "C" "G" "A" "G"
As we can see, there are many “A”s after “T”s in the sequence. This is because pTA has a high value (0.5) in the Markov transition matrix mytransitionmatrix. Similarly, there are few “A”s or “T”s after “C”s, which is because pCA and pCT have low values (0.1) in this transition matrix.
In a Markov model, the nucleotide at a particular position in a sequence depends on the nucleotide found at the previous position. In contrast, in a Hidden Markov model (HMM), the nucleotide found at a particular position in a sequence depends on the state at the previous nucleotide position in the sequence. The state at a sequence position is a property of that position of the sequence, for example, a particular HMM may model the positions along a sequence as belonging to either one of two states, GC-rich or AT-rich. A more complex HMM may model the positions along a sequence as belonging to many different possible states, such as “promoter”, “exon”, “intron”, and “intergenic DNA”.
We will use GC-rich and AT-rich as hidden states and create a transition and emission matrices. First, we make the transition matrix that shows the probabilities when the states change.
# Define the names of the hidden states
states <- c("AT-rich", "GC-rich")
# Set the probabilities of switching states, where the previous state was "AT-rich"
ATrichprobs <- c(0.7, 0.3)
# Set the probabilities of switching states, where the previous state was "GC-rich"
GCrichprobs <- c(0.1, 0.9)
# Create a 2 x 2 matrix
thetransitionmatrix <- matrix(c(ATrichprobs, GCrichprobs), 2, 2, byrow = TRUE)
rownames(thetransitionmatrix) <- states
colnames(thetransitionmatrix) <- states
# Print out the transition matrix
thetransitionmatrix
## AT-rich GC-rich
## AT-rich 0.7 0.3
## GC-rich 0.1 0.9
The second important matrix is the HMM emission matrix, which holds the probabilities of choosing the four nucleotides “A”, “C”, “G”, and “T”, in each of the states.
In a HMM with an AT-rich state and a GC-rich state, the emission matrix will hold the probabilities of choosing each of the four nucleotides “A”, “C”, “G” and “T” in the AT-rich state (for example, pA=0.39, pC=0.1, pG=0.1, and pT=0.41 for the AT-rich state), and the probabilities of choosing “A”, “C”, “G”, and “T” in the GC-rich state (for example, pA=0.1, pC=0.41, pG=0.39, and pT=0.1 for the GC-rich state).
# Define the alphabet of nucleotides
nucleotides <- c("A", "C", "G", "T")
# Set the values of the probabilities, for the AT-rich state
ATrichstateprobs <- c(0.39, 0.1, 0.1, 0.41)
# Set the values of the probabilities, for the GC-rich state
GCrichstateprobs <- c(0.1, 0.41, 0.39, 0.1)
# Create a 2 x 4 matrix
theemissionmatrix <- matrix(c(ATrichstateprobs, GCrichstateprobs), 2, 4, byrow = TRUE)
rownames(theemissionmatrix) <- states
colnames(theemissionmatrix) <- nucleotides
# Print out the emission matrix
theemissionmatrix
## A C G T
## AT-rich 0.39 0.10 0.10 0.41
## GC-rich 0.10 0.41 0.39 0.10
Once the transition and emission matrices are given we can generate a DNA sequence using HMM.
# Function to generate a DNA sequence, given a HMM and the length of the sequence to be generated.
generatehmmseq <- function(transitionmatrix, emissionmatrix, initialprobs, seqlength)
{
nucleotides <- c("A", "C", "G", "T") # Define the alphabet of nucleotides
states <- c("AT-rich", "GC-rich") # Define the names of the states
mysequence <- character() # Create a vector for storing the new sequence
mystates <- character() # Create a vector for storing the state that each position in the new sequence
# was generated by
# Choose the state for the first position in the sequence:
firststate <- sample(states, 1, rep=TRUE, prob=initialprobs)
# Get the probabilities of the current nucleotide, given that we are in the state "firststate":
probabilities <- emissionmatrix[firststate,]
# Choose the nucleotide for the first position in the sequence:
firstnucleotide <- sample(nucleotides, 1, rep=TRUE, prob=probabilities)
mysequence[1] <- firstnucleotide # Store the nucleotide for the first position of the sequence
mystates[1] <- firststate # Store the state that the first position in the sequence was generated by
for (i in 2:seqlength)
{
prevstate <- mystates[i-1] # Get the state that the previous nucleotide in the sequence was generated by
# Get the probabilities of the current state, given that the previous nucleotide was generated by state "prevstate"
stateprobs <- transitionmatrix[prevstate,]
# Choose the state for the ith position in the sequence:
state <- sample(states, 1, rep=TRUE, prob=stateprobs)
# Get the probabilities of the current nucleotide, given that we are in the state "state":
probabilities <- emissionmatrix[state,]
# Choose the nucleotide for the ith position in the sequence:
nucleotide <- sample(nucleotides, 1, rep=TRUE, prob=probabilities)
mysequence[i] <- nucleotide # Store the nucleotide for the current position of the sequence
mystates[i] <- state # Store the state that the current position in the sequence was generated by
}
for (i in 1:length(mysequence))
{
nucleotide <- mysequence[i]
state <- mystates[i]
print(paste("Position", i, ", State", state, ", Nucleotide = ", nucleotide))
}
}
When we are generating a DNA sequence using a HMM, the nucleotide is chosen at each position depending on the state at the previous position in the sequence. As there is no previous nucleotide at the first position in the sequence, the function generatehmmseq()
also requires the probabilities of the choosing each of the states at the first position.
# Define the initial probabilities
theinitialprobs <- c(0.5, 0.5)
# Generate a HMM DNA sequence
generatehmmseq(thetransitionmatrix, theemissionmatrix, theinitialprobs, 30)
## [1] "Position 1 , State AT-rich , Nucleotide = A"
## [1] "Position 2 , State AT-rich , Nucleotide = T"
## [1] "Position 3 , State GC-rich , Nucleotide = C"
## [1] "Position 4 , State GC-rich , Nucleotide = C"
## [1] "Position 5 , State GC-rich , Nucleotide = G"
## [1] "Position 6 , State GC-rich , Nucleotide = G"
## [1] "Position 7 , State GC-rich , Nucleotide = C"
## [1] "Position 8 , State GC-rich , Nucleotide = C"
## [1] "Position 9 , State AT-rich , Nucleotide = T"
## [1] "Position 10 , State AT-rich , Nucleotide = A"
## [1] "Position 11 , State AT-rich , Nucleotide = T"
## [1] "Position 12 , State AT-rich , Nucleotide = T"
## [1] "Position 13 , State GC-rich , Nucleotide = G"
## [1] "Position 14 , State GC-rich , Nucleotide = A"
## [1] "Position 15 , State GC-rich , Nucleotide = G"
## [1] "Position 16 , State GC-rich , Nucleotide = G"
## [1] "Position 17 , State GC-rich , Nucleotide = A"
## [1] "Position 18 , State GC-rich , Nucleotide = C"
## [1] "Position 19 , State GC-rich , Nucleotide = G"
## [1] "Position 20 , State GC-rich , Nucleotide = C"
## [1] "Position 21 , State GC-rich , Nucleotide = C"
## [1] "Position 22 , State GC-rich , Nucleotide = G"
## [1] "Position 23 , State GC-rich , Nucleotide = A"
## [1] "Position 24 , State GC-rich , Nucleotide = C"
## [1] "Position 25 , State GC-rich , Nucleotide = C"
## [1] "Position 26 , State GC-rich , Nucleotide = G"
## [1] "Position 27 , State GC-rich , Nucleotide = T"
## [1] "Position 28 , State GC-rich , Nucleotide = C"
## [1] "Position 29 , State GC-rich , Nucleotide = T"
## [1] "Position 30 , State GC-rich , Nucleotide = C"
As we can see, the nucleotides generated by the GC-rich state are mostly but not all “G”s and “C”s (because of the high values of pG and pC for the GC-rich state in the HMM emission matrix), while the nucleotides generated by the AT-rich state are mostly but not all “A”s and “T”s (because of the high values of pT and pA for the AT-rics state in the HMM emission matrix).
Furthermore, there tends to be runs of nucleotides that are either all in the GC-rich state or all in the AT-rich state, as the transition matrix specifies that the probabilities of switching from the AT-rich to GC-rich state (probability 0.3), or GC-rich to AT-rich state (probability 0.1) are relatively low.
If we have a HMM with two states, “GC-rich” and “AT-rich”, and we know the transmission and emission matrices of the HMM, can we take some new DNA sequence, and figure out which state (GC-rich or AT-rich) is the most likely to have generated each nucleotide position in that DNA sequence?
This is done using Viterbi’s algorithm
viterbi <- function(sequence, transitionmatrix, emissionmatrix)
# This carries out the Viterbi algorithm.
# Adapted from "Applied Statistics for Bioinformatics using R" by Wim P. Krijnen, page 209
# ( cran.r-project.org/doc/contrib/Krijnen-IntroBioInfStatistics.pdf )
{
# Get the names of the states in the HMM:
states <- rownames(theemissionmatrix)
# Make the Viterbi matrix v:
v <- makeViterbimat(sequence, transitionmatrix, emissionmatrix)
# Go through each of the rows of the matrix v (where each row represents
# a position in the DNA sequence), and find out which column has the
# maximum value for that row (where each column represents one state of
# the HMM):
mostprobablestatepath <- apply(v, 1, function(x) which.max(x))
# Print out the most probable state path:
prevnucleotide <- sequence[1]
prevmostprobablestate <- mostprobablestatepath[1]
prevmostprobablestatename <- states[prevmostprobablestate]
startpos <- 1
for (i in 2:length(sequence))
{
nucleotide <- sequence[i]
mostprobablestate <- mostprobablestatepath[i]
mostprobablestatename <- states[mostprobablestate]
if (mostprobablestatename != prevmostprobablestatename)
{
print(paste("Positions",startpos,"-",(i-1), "Most probable state = ", prevmostprobablestatename))
startpos <- i
}
prevnucleotide <- nucleotide
prevmostprobablestatename <- mostprobablestatename
}
print(paste("Positions",startpos,"-",i, "Most probable state = ", prevmostprobablestatename))
}
makeViterbimat <- function(sequence, transitionmatrix, emissionmatrix)
# This makes the matrix v using the Viterbi algorithm.
# Adapted from "Applied Statistics for Bioinformatics using R" by Wim P. Krijnen, page 209
# ( cran.r-project.org/doc/contrib/Krijnen-IntroBioInfStatistics.pdf )
{
# Change the sequence to uppercase
sequence <- toupper(sequence)
# Find out how many states are in the HMM
numstates <- dim(transitionmatrix)[1]
# Make a matrix with as many rows as positions in the sequence, and as many
# columns as states in the HMM
v <- matrix(NA, nrow = length(sequence), ncol = dim(transitionmatrix)[1])
# Set the values in the first row of matrix v (representing the first position of the sequence) to 0
v[1, ] <- 0
# Set the value in the first row of matrix v, first column to 1
v[1,1] <- 1
# Fill in the matrix v:
for (i in 2:length(sequence)) # For each position in the DNA sequence:
{
for (l in 1:numstates) # For each of the states of in the HMM:
{
# Find the probabilility, if we are in state l,
# of choosing the nucleotide at position in the sequence
statelprobnucleotidei <- emissionmatrix[l,sequence[i]]
# v[(i-1),] gives the values of v for the (i-1)th row of v,
# ie. the (i-1)th position in the sequence.
# In v[(i-1),] there are values of v at the (i-1)th row of the sequence for each possible state k.
# v[(i-1),k] gives the value of v at the (i-1)th row of the sequence for a particular state k.
# transitionmatrix[l,] gives the values in the lth row of the transition matrix,
# xx should not be transitionmatrix[,l]?
# probabilities of changing from a previous state k to a current state l.
# max(v[(i-1),] * transitionmatrix[l,]) is the maximum probability for the nucleotide observed
# at the previous position in the sequence in state k, followed by a transition from previous
# state k to current state l at the current nucleotide position.
# Set the value in matrix v for row i (nucleotide position i), column l (state l) to be:
v[i,l] <- statelprobnucleotidei * max(v[(i-1),] * transitionmatrix[,l])
}
}
return(v)
}
Given a HMM, and a particular DNA sequence, we can use the above Viterbi function to find the state of that HMM that was most likely to have generated the nucleotide at each position in the DNA sequence. For example,
# Define a sequence
myseq <- c("A", "A", "G", "C", "G", "T", "G", "G", "G", "G", "C", "C", "C", "C", "G", "G", "C", "G", "A", "C", "A", "T", "G", "G", "G", "G", "T", "G", "T", "C")
# We call the Viterbi's algorithm we just defined to identify the states
viterbi(myseq, thetransitionmatrix, theemissionmatrix)
## [1] "Positions 1 - 2 Most probable state = AT-rich"
## [1] "Positions 3 - 21 Most probable state = GC-rich"
## [1] "Positions 22 - 22 Most probable state = AT-rich"
## [1] "Positions 23 - 30 Most probable state = GC-rich"