17 Markov and hidden Markov models
17.1 Introduction
In this chapter, we discuss probabilistic models for sequences of observations, X1,,XT, of arbitrary length T. Such models have applications in computational biology, natural language processing, time series forecasting, etc. We focus on the case where we the observations occur at discrete time steps, although time may also refer to locations within a sequence.
17.2 Markov models
Recall from Section 10.2.2 that the basic idea behind a Markov chain is to assume that Xt captures all the relevant information for predicting the future i.e., we assume it is a sufficient statistic. If we assume discrete time steps, we can write the joint distribution as follows:
T t2
This is called a Markov chain or Markov model.
If we assume the transition function pXtXt1 is independent of time, then the chain is
called homogeneous, stationary, or timeinvariant. This is an example of parameter tying, since the same parameter is shared by multiple variables. This assumption allows us to model an arbitrary number of variables using a fixed number of parameters; such models are called stochastic processes.
If we assume that the observed variables are discrete, so Xt1, . . . , K, this is called a discretestate or finitestate Markov chain. We will make this assumption throughout the rest of this section.
17.2.1 Transition matrix
When Xt is discrete, so Xt1, . . . , K, the conditional distribution pXtXt1 can be
written as a KK matrix, known as the transition matrix A, where AijpXt
jXt1i is the probability of going from state i to state j. Each row of the matrix sums to
one,
j Aij1, so this is called a stochastic matrix.
pX1:T pX1pX2X1pX3X2 . . .pX1
pXtXt1 17.1
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
590
Chapter 17.
Markov and hidden Markov models
1
1
A11
A22
A33
12
a
A23
A12 123
b
Figure 17.1 State transition diagrams for some simple Markov chains. Left: a 2state chain. Right: a 3state lefttoright chain.
A stationary, finitestate Markov chain is equivalent to a stochastic automaton. It is common to visualize such automata by drawing a directed graph, where nodes represent states and arrows represent legal transitions, i.e., nonzero elements of A. This is known as a state transition diagram. The weights associated with the arcs are the probabilities. For example, the following
17.2
17.3
2state chain
A 1 1
is illustrated in Figure 17.1left. The following 3state chain
A11 A12 0
A 0 A22 A23
001
is illustrated in Figure 17.1right. This is called a lefttoright transition matrix, and is com monly used in speech recognition Section 17.6.2.
The Aij element of the transition matrix specifies the probability of getting from i to j in one step. The nstep transition matrix An is defined as
Aij npXtnj Xti 17.4 which is the probability of getting from i to j in exactly n steps. Obviously A1A. The
ChapmanKolmogorov equations state that K
Aik mAkj n 17.5
In words, the probability of getting from i to j in mn steps is just the probability of getting fromitokinmsteps,andthenfromktojinnsteps,summedupoverallk. Wecanwrite the above as a matrix multiplication
AmnAmAn 17.6 Hence
AnA An1A A An2An 17.7 Thus we can simulate multiple steps of a Markov chain by powering up the transition matrix.
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Aij mn
k1
Copyright2012. MIT Press. All rights reserved.
17.2. Markov models 591
SAYS ITS NOT IN THE CARDS LEGENDARY RECONNAISSANCE BY ROLLIE
DEMOCRACIES UNSUSTAINABLE COULD STRIKE REDLINING VISITS TO PROFIT
BOOKING WAIT HERE AT MADISON SQUARE GARDEN COUNTY COURTHOUSE WHERE HE
HAD BEEN DONE IN THREE ALREADY IN ANY WAY IN WHICH A TEACHER
Table 17.1 Example output from an 4gram word model, trained using backoff smoothing on the Broadcast News corpus. The first 4 words are specified by hand, the model generates the 5th word, and then the results are fed back into the model. Source: http:www.fit.vutbr.czimikolovrnnlmgen4gra m.txt .
17.2.2 Application: Language modeling
One important application of Markov models is to make statistical language models, which are probability distributions over sequences of words. We define the state space to be all the words in English or some other language. The marginal probabilities pXtk are called unigram statistics. If we use a firstorder Markov model, then pXtkXt1j is called a bigram model. If we use a secondorder Markov model, then pXtkXt1j,Xt2i is called a trigram model. And so on. In general these are called ngram models. For example, Figure 17.2 shows 1gram and 2grams counts for the letters a,,z, whererepresents space estimated from Darwins On The Origin Of Species.
Language models can be used for several things, such as the following:
Sentence completion A language model can predict the next word given the previous words in a sentence. This can be used to reduce the amount of typing required, which is particularly important for disabled users see e.g., David Mackays Dasher system1, or uses of mobile devices.
Data compression Any density model can be used to define an encoding scheme, by assigning short codewords to more probable strings. The more accurate the predictive model, the fewer the number of bits it requires to store the data.
Text classification Any density model can be used as a classconditional density and hence turned into a generative classifier. Note that using a 0gram classconditional density i.e., only unigram statistics would be equivalent to a naive Bayes classifier see Section 3.5.
Automatic essay writing One can sample from px1:t to generate artificial text. This is one way of assessing the quality of the model. In Table 17.1, we give an example of text generated from a 4gram model, trained on a corpus with 400 million words. Tomas et al. 2011 describes a much better language model, based on recurrent neural networks, which generates much more semantically plausible text.
1. http:www.inference.phy.cam.ac.ukdasher
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
17.2.2.1
592
Chapter 17. Markov and hidden Markov models
Unigrams
1 0.16098 2 0.06687 a a 3 0.01414 b b 4 0.02938 c c 5 0.03107 d d 6 0.11055 e e 7 0.02325 f f 8 0.01530 g g 9 0.04174 h h
10 0.06233 i i 11 0.00060 j j 12 0.00309 k k 13 0.03515 l l 14 0.02107 m m 15 0.06007 n n 16 0.06066 o o 17 0.01594 p p 18 0.00077 q q 19 0.05265 r r 20 0.05761 s s 21 0.07566 t t 22 0.02149 u u 23 0.00993 v v 24 0.01341 w w 25 0.00208 x x 26 0.01381 y y 27 0.00039 z z
Bigrams
abcdefghijklmnopqrstuvwxyz
Unigram and bigram counts from Darwins On The Origin Of Species. The 2D picture on the right is a Hinton diagram of the joint distribution. The size of the white squares is proportional to the value of the entry in the corresponding vector matrix. Based on MacKay 2003, p22. Figure generated by ngramPlot.
MLE for Markov language models
We now discuss a simple way to estimate the transition matrix from training data. The proba bility of any particular sequence of length T is given by
px1:T x1Ax1, x2 . . . AxT 1, xT17.8
Figure 17.2
K
T K K t2 j1 k1
Copyright2012. MIT Press. All rights reserved.
jIx1j
Hence the loglikelihood of a set of sequences Dx1,,xN, where xixi1,,xi,Ti
j1
is a sequence of length Ti, is given by
N 1
log pDlog pxiNj log jNjk log Ajk i1 j jk
where we define the following counts:
N N Ti1
Nj1Ixi1j, NjkIxi,tj,xi,t1k
i1 i1 t1
17.10
17.11
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
AjkIxtk,xt1j 17.9
17.2.2.2
Although such an approach, based on brute force and ignorance, can be successful, it is rather unsatisfying, since it is clear that this is not how humans learn see e.g., Tenenbaum and Xu 2000. A more refined Bayesian approach, that needs much less data, is described in Section 17.2.2.2.
Empirical Bayes version of deleted interpolation
A common heuristic used to fix the sparse data problem is called deleted interpolation Chen
and Goodman 1996. This defines the transition matrix as a convex combination of the bigram
2. A famous example of an improbable, but syntactically valid, English word string, due to Noam Chomsky, is colourless green ideas sleep furiously. We would not want our model to predict that this string is impossible. Even ungrammatical constructs should be allowed by our model with a certain probability, since people frequently violate grammatical rules, especially in spoken language.
3. See http:googleresearch.blogspot.com200608allourngramarebelongtoyou.html for de tails.
17.2. Markov models 593 Hence we can write the MLE as the normalized counts:
N j1 N j k
j N1, Ajk N 17.12
j j k jk
These results can be extended in a straightforward way to higher order Markov models. However, the problem of zerocounts becomes very acute whenever the number of states K, andor the order of the chain, n, is large. An ngram models has OKn parameters. If we have K50,000 words in our vocabulary, then a bigram model will have about 2.5 billion free parameters, corresponding to all possible word pairs. It is very unlikely we will see all of these in our training data. However, we do not want to predict that a particular word string is totally impossible just because we happen not to have seen it in our training textthat would be a severe form of overfitting.2
A simple solution to this is to use addone smoothing, where we simply add one to all the empirical counts before normalizing. The Bayesian justification for this is given in Section 3.3.4.1. However addone smoothing assumes all ngrams are equally likely, which is not very realistic. A more sophisticated Bayesian approach is discussed in Section 17.2.2.2.
An alternative to using smart priors is to gather lots and lots of data. For example, Google has fit ngram models for n1 : 5 based on one trillion words extracted from the web. Their
data, which is over 100GB when uncompressed, is publically available.3 data, for a set of 4grams, is shown below.
serve as the incoming 92
serve as the incubator 99
serve as the independent 794
serve as the index 223
serve as the indication 72
serve as the indicator 120
serve as the indicators 45
serve as the indispensable 111
serve as the indispensible 40
serve as the individual 234
An example of their
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
594 Chapter 17. Markov and hidden Markov models
frequencies fjkNjkNj and the unigram frequencies fkNkN:
Ajk1fjkfk 17.13
The termis usually set by cross validation. There is also a closely related technique called backoff smoothing; the idea is that if fjk is too small, we back off to a more reliable estimate, namely fk.
We will now show that the deleted interpolation heuristic is an approximation to the predic tions made by a simple hierarchical Bayesian model. Our presentation follows McKay and Peto 1995. First, let us use an independent Dirichlet prior on each row of the transition matrix:
Aj Dir0m1,,0mKDir0mDir 17.14 where Aj is row j of the transition matrix, m is the prior mean satisfyingmk1 and
k
0 is the prior strength. We will use the same prior for each row: see Figure 17.3.
The posterior is given by AjDirNj, where NjNj1,,NjK is the vector that records the number of times we have transitioned out of state j to each of the other states.
From Equation 3.51, the posterior predictive density is
pXt1 kXt j,DAjkNjk mkfjkNj mk 1jfjk jmk17.15 Nj 0 Nj 0
whereAjk EAjkD,and j
This is very similar to Equation 17.13 but not identical. The main difference is that the Bayesian model uses a contextdependent weight j to combine mk with the empirical frequency fjk, rather than a fixed weight . This is like adaptive deleted interpolation. Furthermore, rather than backing off to the empirical marginal frequencies fk, we back off to the model parameter mk.
The only remaining question is: what values should we use forand m? Lets use empirical Bayes. Since we assume each row of the transition matrix is a priori independent given , the marginal likelihood for our Markov model is found by applying Equation 5.24 to each row:
pD
BNj
Nj 0
17.16
j
B 17.17
where NjNj1,,NjK are the counts for leaving state j and B is the generalized beta function.
We can fit this using the methods discussed in Minka 2000e. However, we can also use the following approximation McKay and Peto 1995, p12:
mkj : Njk0 17.18
This says that the prior probability of word k is given by the number of different contexts in which it occurs, rather than the number of times it occurs. To justify the reasonableness of this result, Mackay and Peto McKay and Peto 1995 give the following example.
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
17.2.2.3
Figure 17.3 A Markov chain in which we put a different Dirichlet prior on every row of the transition matrix A, but the hyperparameters of the Dirichlet are shared.
Imagine, you see, that the language, you see, has, you see, a
frequently occuring couplet you see, you see, in which the second
word of the couplet, see, follows the first word, you, with very high
probability, you see. Then the marginal statistics, you see, are going
to become hugely dominated, you see, by the words you and see, with
equal frequency, you see.
If we use the standard smoothing formula, Equation 17.13, then Pyounovel and Pseenovel, for some novel context word not seen before, would turn out to be the same, since the marginal frequencies of you and see are the same 11 times each. However, this seems unreasonable. You appears in many contexts, so Pyounovel should be high, but see only follows you, so Pseenovel should be low. If we use the Bayesian formula Equation 17.15, we will get this effect for free, since we back off to mk not fk, and mk will be large for you and small for see by Equation 17.18.
Unfortunately, although elegant, this Bayesian model does not beat the stateoftheart lan guage model, known as interpolated KneserNey Kneser and Ney 1995; Chen and Goodman 1998. However, in Teh 2006, it was shown how one can build a nonparametric Bayesian model which outperforms interpolated KneserNey, by using variablelength contexts. In Wood et al. 2009, this method was extended to create the sequence memoizer, which is currently 2010 the bestperforming language model.4
Handling outofvocabulary words
While the above smoothing methods handle the case where the counts are small or even zero, none of them deal with the case where the test set may contain a completely novel word. In particular, they all assume that the words in the vocabulary i.e., the state space of Xt is fixed and known typically it is the set of unique words in the training data, or in some dictionary.
4. Interestingly, these nonparametric methods are based on posterior inference using MCMC Section 24.1 andor particle filtering Section 23.5, rather than optimization methods such as EB. Despite this, they are quite efficient.
17.2. Markov models 595
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
596
Chapter 17.
Markov and hidden Markov models
1.0 123
0.5
0.1
1
0.1 1.0
2 0.5 3 0.5 4 b
1.0
0.5
0.9
0.9
a
Figure 17.4 Some Markov chains. a A 3state aperiodic chain. b A reducible 4state chain.
Even if all Ajks are nonzero, none of these models will predict a novel word outside of this set, and hence will assign zero probability to a test sentence with an unfamiliar word. Unfamiliar words are bound to occur, because the set of words is an open class. For example, the set of proper nouns names of people and places is unbounded.
A standard heuristic to solve this problem is to replace all novel words with the special symbol unk, which stands for unknown. A certain amount of probability mass is held aside for this event.
A more principled solution would be to use a Dirichlet process, which can generate a countably infinite state space, as the amount of data increases see Section 25.2.2. If all novel words are accepted as genuine words, then the system has no predictive power, since any misspelling will be considered a new word. So the novel word has to be seen frequently enough to warrant being added to the vocabulary. See e.g., Friedman and Singer 1999; Griffiths and Tenenbaum 2001 for details.
17.2.3 Stationary distribution of a Markov chain
We have been focussing on Markov models as a way of defining joint probability distributions over sequences. However, we can also interpret them as stochastic dynamical systems, where we hop from one state to another at each time step. In this case, we are often interested in the long term distribution over states, which is known as the stationary distribution of the chain. In this section, we discuss some of the relevant theory. Later we will consider two important applications: Googles PageRank algorithm for ranking web pages Section 17.2.4, and the MCMC algorithm for generating samples from hardtonormalize probability distributions Chapter 24.
17.2.3.1 What is a stationary distribution?
Let AijpXtjXt1i be the onestep transition matrix, and let tjpXtj
be the probability of being in state j at time t. It is conventional in this context to assume that
is a row vector. If we have an initial distribution over states of 0, then at time 1 we have
0iAij 17.19
1j
or, in matrix notation,
i
10A 17.20
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
17.2.3.2
17.2. Markov models 597
We can imagine iterating these equations. If we ever reach a stage where
A 17.21
then we say we have reached the stationary distribution also called the invariant distribution or equilibrium distribution. Once we enter the stationary distribution, we will never leave.
For example, consider the chain in Figure 17.4a. To find its stationary distribution, we write
so
or
1A12A13 1 2 31 2 3A21
A12
1A21A23
A32
A13
A23
1A31A32
17.22
17.23
17.24
17.25
A31 111A12A122A213A31
1A12A132A213A31
In general, we have
i AijjAji j i j i
In other words, the probability of being in state i times the net flow out of state i must equal
the probability of being in each other state j times the net flow from that state into i. These
are called the global balance equations. We can then solve these equations, subject to the
j j1.
Computing the stationary distribution
To find the stationary distribution, we can just solve the eigenvector equation AT vv, and then to set vT , where v is an eigenvector with eigenvalue 1. We can be sure such an eigenvector exists, since A is a rowstochastic matrix, so A11; also recall that the eigenvalues of A and AT are the same. Of course, since eigenvectors are unique only up to constants of proportionality, we must normalize v at the end to ensure it sums to one.
Note, however, that the eigenvectors are only guaranteed to be realvalued if the matrix is positive, Aij0 and hence Aij1, due to the sumtoone constraint. A more general approach, which can handle chains where some transition probabilities are 0 or 1 such as Figure 17.4a, is as follows Resnick 1992, p138. We have K constraints from IA0K1 and 1 constraint from 1K10. Since we only have K unknowns, this is overconstrained. So let us replace any column e.g., the last of IA with 1, to get a new matrix, call it M. Next we define r0, 0, . . . , 1, where the 1 in the last position corresponds to the column of all 1s in M. We then solve Mr. For example, for a 3 state chain we have to solve this linear system:
1A11A121
1 2 3 A21 1A22 1 0 0 1 17.26 A31 A32 1
constraint that
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
17.2.3.3
For the chain in Figure 17.4a we find 0.4, 0.4, 0.2. We can easily verify this is correct, since A. See mcStatDist for some Matlab code.
Unfortunately, not all chains have a stationary distribution. as we explain below.
When does a stationary distribution exist?
Consider the 4state chain in Figure 17.4b. If we start in state 4, we will stay there forever, since 4 is an absorbing state. Thus 0, 0, 0, 1 is one possible stationary distribution. However, if we start in 1 or 2, we will oscillate between those two states for ever. So 0.5, 0.5, 0, 0 is another possible stationary distribution. If we start in state 3, we could end up in either of the above stationary distributions.
We see from this example that a necessary condition to have a unique stationary distribution is that the state transition diagram be a singly connected component, i.e., we can get from any state to any other state. Such chains are called irreducible.
Now consider the 2state chain in Figure 17.1a. This is irreducible provided ,0. Suppose 0.9. It is clear by symmetry that this chain will spend 50 of its time in each state. Thus 0.5, 0.5. But now suppose 1. In this case, the chain will oscillate between the two states, but the longterm distribution on states depends on where you start from. If we start in state 1, then on every odd time step 1,3,5, we will be in state 1; but if we start in state 2, then on every odd time step we will be in state 2.
This example motivates the following definition. Let us say that a chain has a limiting distribution if jlimn Anij exists and is independent of i, for all j. If this holds, then the longrun distribution over states will be independent of the starting state:
i
Let us now characterize when a limiting distribution exists. Define the period of state i to be
digcdt : Aiit0 17.28
where gcd stands for greatest common divisor, i.e., the largest integer that divides all the members of the set. For example, in Figure 17.4a, we have d1d2gcd2, 3, 4, 6, 1 and d3gcd3, 5, 6, 1. We say a state i is aperiodic if di1. A sufficient condition to ensure this is if state i has a selfloop, but this is not a necessary condition. We say a chain is aperiodic if all its states are aperiodic. One can show the following important result:
Theorem 17.2.1. Every irreducible singly connected, aperiodic finite state Markov chain has a limiting distribution, which is equal to , its unique stationary distribution.
A special case of this result says that every regular finite state chain has a unique stationary distribution, where a regular chain is one whose transition matrix satisfies Anij0 for some integer n and all i,j, i.e., it is possible to get from any state to any other state in n steps. Consequently, after n steps, the chain could be in any state, no matter where it started. One can show that sufficient conditions to ensure regularity are that the chain be irreducible singly connected and that every state have a selftransition.
To handle the case of Markov chains whose statespace is not finite e.g, the countable set of all integers, or all the uncountable set of all reals, we need to generalize some of the earlier
598 Chapter 17. Markov and hidden Markov models
PXt j
PX0 iAijtj as t 17.27
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
17.2.3.4
definitions. Since the details are rather technical, we just briefly state the main results without proof. See e.g., Grimmett and Stirzaker 1992 for details.
For a stationary distribution to exist, we require irreducibility singly connected and aperiod icity, as before. But we also require that each state is recurrent. A chain in which all states are recurrent is called a recurrent chain. Recurrent means that you will return to that state with probability 1. As a simple example of a nonrecurrent state i.e., a transient state, consider Figure 17.4b: states 3 is transient because one immediately leaves it and either spins around state 4 forever, or oscillates between states 1 and 2 forever. There is no way to return to state 3.
It is clear that any finitestate irreducible chain is recurrent, since you can always get back to where you started from. But now consider an example with an infinite state space. Suppose we perform a random walk on the integers, X,2,1,0,1,2,. Let Ai,i1p be the probability of moving right, and Ai,i11p be the probability of moving left. Suppose we start at X10. If p0.5, we will shoot off to ; we are not guaranteed to return. Similarly, if p0.5, we will shoot off to . So in both cases, the chain is not recurrent, even though it is irreducible.
It should be intuitively obvious that we require all states to be recurrent for a stationary distribution to exist. However, this is not sufficient. To see this, consider the random walk on the integers again, and suppose p0.5. In this case, we can return to the origin an infinite number of times, so the chain is recurrent. However, it takes infinitely long to do so. This prohibits it from having a stationary distribution. The intuitive reason is that the distribution keeps spreading out over a larger and larger set of the integers, and never converges to a stationary distribution. More formally, we define a state to be nonnull recurrent if the expected time to return to this state is finite. A chain in which all states are nonnull is called a nonnull chain.
For brevity, we we say that a state is ergodic if it is aperiodic, recurrent and nonnull, and we say a chain is ergodic if all its states are ergodic.
We can now state our main theorem:
Theorem 17.2.2. Every irreducible singly connected, ergodic Markov chain has a limiting distri bution, which is equal to , its unique stationary distribution.
This generalizes Theorem 17.2.1, since for irreducible finitestate chains, all states are recurrent and nonnull.
Detailed balance
Establishing ergodicity can be difficult. We now give an alternative condition that is easier to verify.
We say that a Markov chain A is time reversible if there exists a distributionsuch that
iAijjAji 17.29
These are called the detailed balance equations. This says that the flow from i to j must equal the flow from j to i, weighted by the appropriate source probabilities.
We have the following important result.
Theorem 17.2.3. If a Markov chain with transition matrix A is regular and satisfies detailed balance wrt distribution , thenis a stationary distribution of the chain.
17.2. Markov models 599
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
600
Chapter 17.
Markov and hidden Markov models
X1
X4
X2
X3
Figure 17.5
A very small world wide web. Figure generated by pagerankDemo, written by Tim Davis.
X6
X5
Proof. To see this, note that
iAijjAjij Ajij 17.30 iii
and hence A.
Note that this condition is sufficient but not necessary see Figure 17.4a for an example of a chain with a stationary distribution which does not satisfy detailed balance.
In Section 24.1, we will discuss Markov chain Monte Carlo or MCMC methods. These take as input a desired distributionand construct a transition matrix or in general, a transition kernel A which satisfies detailed balance wrt . Thus by sampling states from such a chain, we will eventually enter the stationary distribution, and will visit states with probabilities given by .
17.2.4 Application: Googles PageRank algorithm for web page ranking
The results in Section 17.2.3 form the theoretical underpinnings to Googles PageRank algorithm, which is used for information retrieval on the worldwide web. We sketch the basic idea below; see Byran and Leise 2006 for a more detailed explanation.
We will treat the web as a giant directed graph, where nodes represent web pages documents and edges represent hyperlinks.5 We then perform a process called web crawling. We start at a few designated root nodes, such as dmoz.org, the home of the Open Directory Project, and then follows the links, storing all the pages that we encounter, until we run out of time.
Next, all of the words in each web page are entered into a data structure called an inverted index. That is, for each word, we store a list of the documents where this word occurs. In practice, we store a list of hash codes representing the URLs. At test time, when a user enters
5. In 2008, Google said it had indexed 1 trillion 1012 unique URLs. If we assume there are about 10 URLs per page on average, this means there were about 100 billion unique web pages. Estimates for 2010 are about 121 billion unique web pages. Source: thenextweb.comshareables20110111infographichowbigistheinternet.
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
17.2. Markov models 601
a query, we can just look up all the documents containing each word, and intersect these lists since queries are defined by a conjunction of search terms. We can get a refined search by storing the location of each word in each document. We can then test if the words in a document occur in the same order as in the query.
Let us give an example, from http:en.wikipedia.orgwikiInvertedindex. We have 3 documents, T0it is what it is, T1what is it and T2it is a banana. Then we can create the following inverted index, where each pair represents a document and word location:
a:2, 2
banana: 2, 3
is: 0, 1, 0, 4, 1, 1, 2, 1
it: 0, 0, 0, 3, 1, 2, 2, 0
what: 0, 2, 1, 0
For example, we see that the word what occurs at location 2 counting from 0 in document 0, and location 0 in document 1. Suppose we search for what is it. If we ignore word order, we retrieve the following documents:
T0, T1T0, T1, T2T0, T1, T2T0, T1 17.31
If we require that the word order matches, only document T1 would be returned. More generally, we can allow outoforder matches, but can give bonus points to documents whose word order matches the querys word order, or to other features, such as if the words occur in the title of a document. We can then return the matching documents in decreasing order of their score relevance. This is called document ranking.
So far, we have described the standard process of information retrieval. But the link structure of the web provides an additional source of information. The basic idea is that some web pages are more authoritative than others, so these should be ranked higher assuming they match the query. A web page is an authority if it is linked to by many other pages. But to protect against the effect of socalled link farms, which are dummy pages which just link to a given site to boost its apparent relevance, we will weight each incoming link by the sources authority. Thus we get the following recursive definition for the authoritativeness of page j, also called its PageRank:
i
where Aij is the probability of following a link from i to j. We recognize Equation 17.32 as the stationary distribution of a Markov chain.
In the simplest setting, we define Ai. as a uniform distribution over all states that i is connected to. However, to ensure the distribution is unique, we need to make the chain into a regular chain. This can be done by allowing each state i to jump to any other state including itself with some small probability. This effectively makes the transition matrix aperiodic and fully connected although the adjacency matrix Gij of the web itself is highly sparse.
We discuss efficient methods for computing the leading eigenvector of this giant matrix below. But first, let us give an example of the PageRank algorithm. Consider the small web in Figure 17.5.
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
j
Aiji 17.32
Copyright2012. MIT Press. All rights reserved.
17.2.4.1
Figure generated by pagerankDemoPmtk, Based on code by Cleve Moler Moler 2004.
We find that the stationary distribution is
0.3209, 0.1706, 0.1065, 0.1368, 0.0643, 0.2008 17.33
So a random surfer will visit site 1 about 32 of the time. We see that node 1 has a higher PageRank than nodes 4 or 6, even though they all have the same number of inlinks. This is because being linked to from an influential nodehelps increase your PageRank score more than being linked to by a less influential node.
As a slightly larger example, Figure 17.6a shows a web graph, derived from the root of harvard.edu. Figure 17.6b shows the corresponding PageRank vector.
Efficiently computing the PageRank vector
Let Gij1 iff there is a link from j to i. Now imagine performing a random walk on this graph, where at every time step, with probability p0.85 you follow one of the outlinks uniformly at random, and with probability 1p you jump to a random node, again chosen uniformly at random. If there are no outlinks, you just jump to a random page. These random jumps, including selftransitions, ensure the chain is irreducible singly connected and regular. Hence we can solve for its unique stationary distribution using eigenvector methods. This defines the following transition matrix:
602
Chapter 17.
Markov and hidden Markov models
Figure 17.6
b
a Web graph of 500 sites rooted at www.harvard.edu. b Corresponding page rank vector.
0 50 100 150 200 250 300 350 400 450 500
0 100 200 300 400 500 nz2636
a
0.02 0.018 0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002
0
0 100 200 300 400 500
Copyright2012. MIT Press. All rights reserved.
pGij cj if cj0 17.34 1n if cj0
Mij
where n is the number of nodes, 1pn is the probability of jumping from one page
i Gij represents the outdegree of page j. If n4109 and p0.85, then 3.751011. Here M is a stochastic matrix in which
columns sum to one. Note that MAT in our earlier notation.
We can represent the transition matrix compactly as follows. Define the diagonal matrix D
with entries
to another without following a link and cj
djj
1cj if cj0 17.35 0 if cj0
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
17.3. Hidden Markov models
603
Define the vector z with components
zj
if cj0 1n if cj0
17.36
17.37
Then we can rewrite Equation 17.34 as follows:
MpGD1zT
The matrix M is not sparse, but it is a rank one modification of a sparse matrix. Most of the elements of M are equal to the small constant . Obviously these do not need to be stored explicitly.
Our goal is to solve vMv, where vT . One efficient method to find the leading eigenvector of a large matrix is known as the power method. This simply consists of repeated matrixvector multiplication, followed by normalization:
vMvpGDv1zT v 17.38
It is possible to implement the power method without using any matrix multiplications, by simply sampling from the transition matrix and counting how often you visit each state. This is essentially a Monte Carlo approximation to the sum implied by vMv. Applying this to the data in Figure 17.6a yields the stationary distribution in Figure 17.6b. This took 13 iterations to converge, starting from a uniform distribution. See also the function pagerankDemo, by Tim Davis, for an animation of the algorithm in action, applied to the small web example. To handle changing web structure, we can rerun this algorithm every day or every week, starting v off at the old distribution Langville and Meyer 2006.
For details on how to perform this Monte Carlo power method in a parallel distributed computing environment, see e.g., Rajaraman and Ullman 2010.
17.2.4.2 Web spam
PageRank is not foolproof. For example, consider the strategy adopted by JC Penney, a depart ment store in the USA. During the Christmas season of 2010, it planted many links to its home page on 1000s of irrelevant web pages, thus increasing its ranking on Googles search engine Segal 2011. Even though each of these source pages has low PageRank, there were so many of them that their effect added up. Businesses call this search engine optimization; Google calls it web spam. When Google was notified of this scam by the New York Times, it manually downweighted JC Penney, since such behavior violates Googles code of conduct. The result was that JC Penney dropped from rank 1 to rank 65, essentially making it disappear from view. Automatically detecting such scams relies on various techniques which are beyond the scope of this chapter.
17.3 Hidden Markov models
As we mentioned in Section 10.2.2, a hidden Markov model or HMM consists of a discretetime,
discretestate Markov chain, with hidden states zt1,,K, plus an observation model
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
604
Chapter 17.
Markov and hidden Markov models
20
3 2.8 2.6 2.4 2.2 2 1.8 015 20181.6 1.4 1.2 1
10
20 15 10 5 0 5 10 15 20
a
hidden state sequence. Based on Figure 13.8 of Bishop 2006b. Figure generated by hmmLillypadDemo.
pxtzt. The corresponding joint distribution has the form
T T
pz1:T , x1:T pz1:T px1:T z1:T pz1 pztzt1 pxtzt 17.39 t2 t1
The observations in an HMM can be discrete or continuous. If they are discrete, it is common for the observation model to be an observation matrix:
pxtlztk, Bk, l 17.40 If the observations are continuous, it is common for the observation model to be a conditional
Gaussian:
pxtztk,Nxtk,k 17.41
Figure 17.7 shows an example where we have 3 states, each of which emits a different Gaussian. The resulting model is similar to a Gaussian mixture model, except the cluster membership has Markovian dynamics. Indeed, HMMs are sometimes called Markov switching models FruhwirthSchnatter 2007. We see that we tend to get multiple observations in the same location, and then a sudden jump to a new cluster.
17.3.1 Applications of HMMs
HMMs can be used as blackbox density models on sequences. They have the advantage over Markov models in that they can represent longrange dependencies between observations, mediated via the latent variables. In particular, note that they do not assume the Markov property holds for the observations themselves. Such blackbox models are useful for time series prediction Fraser 2008. They can also be used to define classconditional densities inside a generative classifier.
However, it is more common to imbue the hidden states with some desired meaning, and to then try to estimate the hidden states from the observations, i.e., to compute pztx1:t if we are
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
15
12 10 9
13
11
10
57
17
14
3 1625 4
5 1 19
8
b
Figure 17.7 a Some 2d data sampled from a 3 state HMM. Each state emits from a 2d Gaussian. b The
2 4 6 8 10 12 14 16 18 20
Copyright2012. MIT Press. All rights reserved.
17.3. Hidden Markov models
605
bat rat cat gnat goat
xxx AGC AAGC AGAA AAAC AGC 123
a
DDD IIII
M M M
0 1 2 3 4
b
a Some DNA sequences. b State transition diagram for a profile HMM. Source: Figure 5.7 of Durbin et al. 1998. Used with kind permission of Richard Durbin.
in an online scenario, or pztx1:Tif we are in an offline scenario see Section 17.4.1 for further discussion of the differences between these two approaches. Below we give some examples of applications which use HMMs in this way:
Automatic speech recognition. Here xt represents features extracted from the speech signal, and zt represents the word that is being spoken. The transition model pztzt1 represents the language model, and the observation model pxtzt represents the acoustic model. See e.g., Jelinek 1997; Jurafsky and Martin 2008 for details.
Activity recognition. Here xt represents features extracted from a video frame, and zt is the class of activity the person is engaged in e.g., running, walking, sitting, etc. See e.g., Szeliski 2010 for details.
Part of speech tagging. Here xt represents a word, and zt represents its part of speech noun, verb, adjective, etc. See Section 19.6.2.1 for more information on POS tagging and
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Figure 17.8
Begin
End
Copyright2012. MIT Press. All rights reserved.
606 Chapter 17. Markov and hidden Markov models
related tasks.
Gene finding. Here xt represents the DNA nucleotides A,C,G,T, and zt represents whether
we are inside a genecoding region or not. See e.g., Schweikerta et al. 2009 for details.
Protein sequence alignment. Here xt represents an amino acid, and zt represents whether this matches the latent consensus sequence at this location. This model is called a profile HMM and is illustrated in Figure 17.8. The HMM has 3 states, called match, insert and delete. If zt is a match state, then xt is equal to the tth value of the consensus. If zt is an insert state, then xt is generated from a uniform distribution that is unrelated to the consensus sequence. If zt is a delete state, then xt. In this way, we can generate noisy copies of the consensus sequence of different lengths. In Figure 17.8a, the consensus is AGC, and we see various versions of this below. A path through the state transition diagram, shown in Figure 17.8b, specifies how to align a sequence to the consensus, e.g., for the gnat, the most probable path is D,D,I,I,I,M. This means we delete the A and G parts of the consensus sequence, we insert 3 As, and then we match the final C. We can estimate the model parameters by counting the number of such transitions, and the number of emissions from each kind of state, as shown in Figure 17.8c. See Section 17.5 for more information on
training an HMM, and Durbin et al. 1998 for details on profile HMMs.
Note that for some of these tasks, conditional random fields, which are essentially discrimi native versions of HMMs, may be more suitable; see Chapter 19 for details.
17.4 Inference in HMMs
We now discuss how to infer the hidden state sequence of an HMM, assuming the parameters are known. Exactly the same algorithms apply to other chainstructured graphical models, such as chain CRFs see Section 19.6.1. In Chapter 20, we generalize these methods to arbitrary graphs. And in Section 17.5.2, we show how we can use the output of inference in the context of parameter estimation.
17.4.1 Types of inference problems for temporal models
There are several different kinds of inferential tasks for an HMM and SSM in general. To illustrate the differences, we will consider an example called the occasionally dishonest casino, from Durbin et al. 1998. In this model, xt1, 2, . . . , 6 represents which dice face shows up, and zt represents the identity of the dice that is being used. Most of the time the casino uses a fair dice, z1, but occasionally it switches to a loaded dice, z2, for a short period. If z1 the observation distribution is a uniform multinoulli over the symbols 1, . . . , 6. If z2, the observation distribution is skewed towards face 6 see Figure 17.9. If we sample from this model, we may observe data such as the following:
Listing 17.1 Example output of casinoDemo
Rolls: 664153216162115234653214356634261655234232315142464156663246 Die: LLLLLLLLLLLLLLFFFFFFLLLLLLLLLLLLLLFFFFFFFFFFFFFFFFFFLLLLLLLL
Here rolls refers to the observed symbol and die refers to the hidden state L is loaded and F is fair. Thus we see that the model generates a sequence of symbols, but the statistics of the
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
17.4. Inference in HMMs 607
Figure 17.9 An HMM for the occasionally dishonest casino. The blue arrows visualize the state transition diagram A. Based on Durbin et al. 1998, p54.
filtered smoothed 111
0.5 0.5 0.5
000
Viterbi
0 50 100 150 200 250 300 roll number
c
0 50 100 150 200 250 300 roll number
a
0 50 100 150 200 250 300 roll number
b
Figure 17.10 Inference in the dishonest casino. Vertical gray bars denote the samples that we generated using a loaded die. a Filtered estimate of probability of using a loaded dice. b Smoothed estimates. c MAP trajectory. Figure generated by casinoDemo.
distribution changes abruptly every now and then. In a typical application, we just see the rolls and want to infer which dice is being used. But there are different kinds of inference, which we summarize below.
Filtering means to compute the belief state pztx1:t online, or recursively, as the data streams in. This is called filtering because it reduces the noise more than simply estimating the hidden state using just the current estimate, pztxt. We will see below that we can perform filtering by simply applying Bayes rule in a sequential fashion. See Figure 17.10a for an example.
Smoothing means to compute pztx1:Toffline, given all the evidence. See Figure 17.10b for an example. By conditioning on past and future data, our uncertainty will be significantly reduced. To understand this intuitively, consider a detective trying to figure out who com mitted a crime. As he moves through the crime scene, his uncertainty is high until he finds the key clue; then he has an aha moment, his uncertainty is reduced, and all the previously confusing observations are, in hindsight, easy to explain.
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
ploaded
ploaded
MAP state 0fair,1loaded
608
Chapter 17.
Markov and hidden Markov models
The main kinds of inference for statespace models. The shaded region is the interval for which we have data. The arrow represents the time step at which we want to perform inference. t is the current time, T is the sequence length, l is the lag and h is the prediction horizon. See text for details.
Fixed lag smoothing is an interesting compromise between online and offline estimation; it involves computing pztlx1:t, where l0 is called the lag. This gives better performance than filtering, but incurs a slight delay. By changing the size of the lag, one can trade off accuracy vs delay.
Prediction Instead of predicting the past given the future, as in fixed lag smoothing, we might want to predict the future given the past, i.e., to compute pzthx1:t, where h0 is called the prediction horizon. For example, suppose h2; then we have
zt1 zt
Figure 17.11
pzt2 x1:t
pzt2 zt1 pzt1 zt pzt x1:t17.42
It is straightforward to perform this computation: we just power up the transition matrix and apply it to the current belief state. The quantity pzthx1:t is a prediction about future hidden states; it can be converted into a prediction about future observations using
zth
This is the posterior predictive density, and can be used for timeseries forecasting see Fraser 2008 for details. See Figure 17.11 for a sketch of the relationship between filtering, smoothing, and prediction.
MAP estimation This means computing arg maxz1:T pz1:T x1:T , which is a most prob able state sequence. In the context of HMMs, this is known as Viterbi decoding see
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
pxthx1:t
pxthzthpzthx1:t 17.43
Copyright2012. MIT Press. All rights reserved.
17.4. Inference in HMMs 609
Section 17.4.4. Figure 17.10 illustrates the difference between filtering, smoothing and MAP decoding for the occasionally dishonest casino HMM. We see that the smoothed offline estimate is indeed smoother than the filtered online estimate. If we threshold the estimates at 0.5 and compare to the true sequence, we find that the filtered method makes 71 errors out of 300, and the smoothed method makes 49300; the MAP path makes 60300 errors. It is not surprising that smoothing makes fewer errors than Viterbi, since the optimal way to min imize biterror rate is to threshold the posterior marginals see Section 5.7.1.1. Nevertheless, for some applications, we may prefer the Viterbi decoding, as we discuss in Section 17.4.4.
Posterior samples If there is more than one plausible interpretation of the data, it can be useful to sample from the posterior, z1:Tpz1:T x1:T . These sample paths contain much more information than the sequence of marginals computed by smoothing.
Probability of the evidence We can compute the probability of the evidence, px1:T ,
rule:
by summing up over all hidden paths, px1:Tclustering, for anomaly detection, etc.
pz1:T , x1:T . This can be used to classify sequences e.g., if the HMM is used as a class conditional density, for modelbased
z1:T
We now describe how to recursively compute the filtered marginals, pztx1:t in an HMM. The algorithm has two steps. First comes the prediction step, in which we compute the
onestepahead predictive density; this acts as the new prior for time t:
pztjx1:t1
Next comes the update step, in which we absorb the observed data from time t using Bayes
17.4.2 The forwards algorithm
i
tjpztjx1:tpztjxt, x1:t11pxz j,xpz jx
17.45
17.46
17.47
t t 1:t1 t 1:t1 where the normalization constant is given by
pztjzt1ipzt1ix1:t1 17.44
Zt
Ztpxtx1:t1
pztjx1:t1pxtztj
j
This process is known as the predictupdate cycle. The distribution pztx1:t is called the filtered belief state at time t, and is a vector of K numbers, often denoted by t. In matrix vector notation, we can write the update in the following simple form:
ttT t1 17.48
where tjpxtztj is the local evidence at time t, i,jpztjzt1i is the transition matrix, and uv is the Hadamard product, representing elementwise vector multiplication. See Algorithm 6 for the pseudocode, and hmmFilter for some Matlab code.
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
610 Chapter 17. Markov and hidden Markov models In addition to computing the hidden states, we can use this algorithm to compute the log
probability of the evidence:
t1
Algorithm 17.1: Forwards algorithm
1 Input: Transition matrices i, jpztjzt1i, local evidence vectors tjpxtztj, initial state distribution jpz1j;
2 1,Z1normalize1;
3 fort2:Tdo
4 t, ZtnormalizetT t1 ;
5 Return 1:T and log py1:Tlog Zt; t
6 Subroutine: v,Znormalizeu : Z uj; vjujZ; j
17.4.3 The forwardsbackwards algorithm
T log px1:T
T t1
log pxtx1:t1
We need to work in the log domain to avoid numerical underflow.
17.49
log Zt
In Section 17.4.2, we explained how to compute the filtered marginals pztjx1:t using online inference. We now discuss how to compute the smoothed marginals, pztjx1:T , using offline inference.
17.4.3.1 Basic idea
The key decomposition relies on the fact that we can break the chain into two parts, the past and the future, by conditioning on zt:
pz jx pz j,x x pz jx px z j,x t 1:T t t1:T 1:t t 1:t t1:T t 1:t
Let tjpztjx1:t be the filtered belief state as before. Also, define tjpxt1:T ztj
17.50
17.51
as the conditional likelihood of future evidence given that the hidden state at time t is j. Note that this is not a probability distribution over states, since it does not need to satisfy
j tj1. Finally, define
tjpztjx1:T17.52 as the desired smoothed posterior marginal. From Equation 17.50, we have
tjtjtj 17.53
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
17.4.3.2
which is the probability of a nonevent.
Having computed the forwards and backwards messages, we can combine them to compute
tjtjtj. The overall algorithm is known as the forwardsbackwards algorithm. The pseudo code is very similar to the forwards case; see hmmFwdBack for an implementation.
We can think of this algorithm as passing messages from left to right, and then from right to left, and then combining them at each node. We will generalize this intuition in Section 20.2, when we discuss belief propagation.
Twoslice smoothed marginals
When we estimate the parameters of the transition matrix using EM see Section 17.5, we will need to compute the expected number of transitions from state i to state j:
T1 T1
NijEIzt i,zt1 jx1:T pzt i,zt1 jx1:T 17.61 t1 t1
The term pzti,zt1jx1:T is called a smoothed twoslice marginal, and can be computed as follows
17.4. Inference in HMMs 611
We have already described how to recursively compute the s in a lefttoright fashion in Section 17.4.2. We now describe how to recursively compute the s in a righttoleft fashion. If we have already computed t, we can compute t1 as follows:
t1 i
pxt:T zt1i
pztj, xt, xt1:T zt1i
17.54 17.55
17.56 17.57 17.58
17.59
17.60
We can write the
j
j
resulting equation in matrixvector form as
j
px
px tjtji, j
t1tt The base case is
T ipxT1:T zTipzTi1
z j,z i,x pz j,x z i
t1:Tt t1 t t tt1
z jpx z j,z ipz jz i
t1:Tt tt t1 t t1
j
t,t1i, jpzti, zt1jx1:T
pzt x1:t pzt1 zt , xt1:T
pzt x1:t pxt1:T zt , zt1 pzt1 zt
pzt x1:t pxt1 zt1 pxt2:T zt1 pzt1 zt
tit1jt1ji, j
17.62 17.63 17.64 17.65 17.66
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
17.4.3.3
17.4.4
17.4.4.1
612 Chapter 17. Markov and hidden Markov models
In matrixvector form, we have
t,t1tt1t1T
For another interpretation of these equations, see Section 20.2.4.3.
Time and space complexity
17.67
It is clear that a straightforward implementation of FB takes OK2T time, since we must perform a KK matrix multiplication at each step. For some applications, such as speech recognition, K is very large, so the OK2 term becomes prohibitive. Fortunately, if the transition matrix is sparse, we can reduce this substantially. For example, in a lefttoright transition matrix, the algorithm takes OTK time.
In some cases, we can exploit special properties of the state space, even if the transition matrix is not sparse. In particular, suppose the states represent a discretization of an underlying continuous statespace, and the transition matrix has the form i, jexp2zizj , where zi is the continuous vector represented by state i. Then one can implement the forwards backwards algorithm in OT K log K time. This is very useful for models with large state spaces. See Section 22.2.6.1 for details.
In some cases, the bottleneck is memory, not time. The expected sufficient statistics needed
t t1,t i, j ; this takes constant space independent of T ; however, to compute them, we need OKT working space, since we must store t for t1,,T until we do the backwards pass. It is possible to devise a simple divideandconquer algorithm that reduces the space complexity from OKTto OK log Tat the cost of increasing the running time from OK2Tto OK2T log T : see Binder et al. 1997; Zweig and Padmanabhan 2000 for details.
The Viterbi algorithm
The Viterbi algorithm Viterbi 1967 can be used to compute the most probable sequence of
states in a chainstructured graphical model, i.e., it can compute
zarg max pz1:T x1:T17.68
z1:T
This is equivalent to computing a shortest path through the trellis diagram in Figure 17.12, where the nodes are possible states at each time step, and the node and edge weights are log probabilities. That is, the weight of a path z1 , z2 , . . . , zT is given by
T t2
Before discussing how the algorithm works, let us make one important remark: the jointly most probable sequence of states is not necessarily the same as the sequence of marginally most probable states. The former is given by Equation 17.68, and is what Viterbi computes, whereas the latter is given by the maximizer of the posterior marginals or MPM:
zarg max pz1 x1:T , . . . , arg max pzT x1:T17.70 z1 zT
by EM are
log 1z1log 1z1MAP vs MPE
log zt1, ztlog tzt 17.69
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
17.4. Inference in HMMs 613
Figure 17.12 The trellis of states vs time for a Markov chain. Based on Rabiner 1989.
As a simple example of the difference, consider a chain with two time steps, defining the following joint:
X10 X11
X20 0.04 X21 0.36
0.3 0.34
pz1:t1, zt, zt1:T x1:T
its neighbors, rather than conditioning on a specific value of its neighbors.6
0.3 0.66 0.4 0.6
The joint MAP estimate is 0, 1, whereas the sequence of marginal MPMs is 1, 1.
The advantage of the joint MAP estimate is that is is always globally consistent. For example, suppose we are performing speech recognition and someones says recognize speech. This could be misheard as wreck a nice beach. Locally it may appear that beach is the most probable interpretation of that particular window of sound, but when we add the requirement that the data be explained by a single linguistically plausible path, this interpretation becomes less likely.
On the other hand, the MPM estimates can be more robust Marroquin et al. 1987. To see why, note that in Viterbi, when we estimate zt, we max out the other variables:
ztarg max max pz1:t1, zt, zt1:T x1:Tzt z1:t1 ,zt1:T
whereas we when we use forwardsbackwards, we sum out the other variables:
6. In general, we may want to mix max and sum. For example, consider a joint distribution where we observe
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
pztx1:T
This makes the MPM in Equation 17.70 more robust, since we estimate each node averaging over
z1:t1 ,zt1:T
17.71
17.72
Copyright2012. MIT Press. All rights reserved.
17.4.4.2
614 Chapter 17. Markov and hidden Markov models Details of the algorithm
It is tempting to think that we can implement Viterbi by just replacing the sumoperator in forwardsbackwards with a maxoperator. The former is called the sumproduct, and the latter the maxproduct algorithm. If there is a unique mode, running maxproduct and then computing using Equation 17.70 will give the same result as using Equation 17.68 Weiss and Freeman 2001b, but in general, it can lead to incorrect results if there are multiple equally probably joint assignments. The reasons is that each node breaks ties independently and hence may do so in a manner that is inconsistent with its neighbors. The Viterbi algorithm is therefore not quite as simple as replacing sum with max. In particular, the forwards pass does use max product, but the backwards pass uses a traceback procedure to recover the most probable path through the trellis of states. Essentially, once zt picks its most probable state, the previous nodes condition on this event, and therefore they will break ties consistently.
In more detail, define
tjmax pz1:t1, ztjx1:t 17.73
z1 ,,zt1
This is the probability of ending up in state j at time t, given that we take the most probable path. The key insight is that the most probable path to state j at time t must consist of the most probable path to some other state i at time t1, followed by a transition from i to j. Hence
tjmax t1ii, jtj 17.74 i
We also keep track of the most likely previous state, for each possible state that we end up in: atjargmax t1ii, jtj 17.75
i
That is, atj tells us the most likely previous state on the most probable path to ztj. We
initialize by setting
1jj 1j
and we terminate by computing the most probable final state zT :
zT argmaxTi i
We can then compute the most probable sequence of states using traceback: za z
17.76
17.77
17.78
t t1 t1
As usual, we have to worry about numerical underflow. We are free to normalize the t terms
at each step; this will not affect the maximum. However, unlike the forwardsbackwards case,
v and we want to query q; let n be the remaining nuisance variables. We define the MAP estimate as x
q
xn pxq , xn xv , where we max over xq and sum over xn . By contrast, we define the MPE or
arg maxxq
most probable explanation as xq,xnargmaxxq,xn pxq,xnxv, where we max over both xq and xn. This terminology is due to Pearl 1988, although it is not widely used outside the Bayes net literatire. Obviously MAPMPE if n. However, if n, then summing out the nuisance variables can give different results than maxing them out. Summing out nuisance variables is more sensible, but computationally harder, because of the need to combine max and sum operations Lerner and Parr 2001.
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
17.4. Inference in HMMs
615
b
Illustration of Viterbi decoding in a simple HMM for speech recognition. a A 3state HMM for a single phone. We are visualizing the state transition diagram. We assume the observations have been vector quantized into 7 possible symbols, C1 , . . . , C7 . Each state z1 , z2 , z3 has a different distribution over these symbols. Based on Figure 15.20 of Russell and Norvig 2002. b Illustration of the Viterbi algorithm applied to this model, with data sequence C1, C3, C4, C6. The columns represent time, and the rows represent states. An arrow from state i at t1 to state j at t is annotated with two numbers: the first is the probability of the ij transition, and the second is the probability of generating observation xt from state j. The bold lines circles represent the most probable sequence of states. Based on Figure 24.27 of Russell and Norvig 1995.
we can also easily work in the log domain. The key difference is that logmaxmaxlog, whereas loglog. Hence we can use
log tjmax log pz1:t1, ztjx1:t 17.79 z1:t1
max log t1ilog i, jlog tj 17.80 i
In the case of Gaussian observation models, this can result in a significant constant factor speedup, since computing log pxtzt can be much faster than computing pxtzt for a high dimensional Gaussian. This is one reason why the Viterbi algorithm is widely used in the E step of EM Section 17.5.2 when training large speech recognition systems based on HMMs.
Example
Figure 17.13 gives a worked example of the Viterbi algorithm, based on Russell et al. 1995. Suppose we observe the discrete sequence of observations x1:4C1, C3, C4, C6, representing codebook entries in a vectorquantized version of a speech signal. The model starts in state z1. The probability of generating C1 in z1 is 0.5, so we have 110.5, and 1i0 for all other states. Next we can selftransition to z1 with probability 0.3, or transition to z2 with proabability 0.7. If we end up in z1, the probability of generating C3 is 0.3; if we end up in z2,
17.4.4.3
Figure 17.13
a
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
17.4.4.4
17.4.4.5
Thus state 2 is more probable at t2; see the second column of Figure 17.13b. In time step 3, we see that there are two paths into z2, from z1 and from z2. The bold arrow indicates that the latter is more probable. Hence this is the only one we have to remember. The algorithm continues in this way until we have reached the end of the sequence. One we have reached the end, we can follow the black arrows back to recover the MAP path which is 1,2,2,3.
Time and space complexity
The time complexity of Viterbi is clearly OK2T in general, and the space complexity is OKT, both the same as forwardsbackwards. If the transition matrix has the form i, jexp2zizj 2, where zi is the continuous vector represented by state i, we can implement Viterbi in OT K time, instead of OT K log K needed by forwardsbackwards. See Section 22.2.6.1 for details.
Nbest list
The Viterbi algorithm returns one of the most probable paths. It can be extended to return the top N paths Schwarz and Chow 1990; Nilsson and Goldberger 2001. This is called the Nbest list. Once can then use a discriminative method to rerank the paths based on global features derived from the fully observed state sequence as well as the visible features. This technique is widely used in speech recognition. For example, consider the sentence recognize speech. It is possible that the most probable interpretation by the system of this acoustic signal is wreck a nice speech, or maybe wreck a nice beach. Maybe the correct interpretation is much lower down on the list. However, by using a reranking system, we may be able to improve the score of the correct interpretation based on a more global context.
One problem with the Nbest list is that often the top N paths are very similar to each other, rather than representing qualitatively different interpretations of the data. Instead we might want to generate a more diverse set of paths to more accurately represent posterior uncertainty. One way to do this is to sample paths from the posterior, as we discuss below. For some other ways to generate diverse MAP estimates, see e.g., Yadollahpour et al. 2011; Kulesza and Taskar 2011.
Forwards filtering, backwards sampling
It is often useful to sample paths from the posterior:
zs1:T pz1:Tx1:T 17.83
We can do this is as follow: run forwards backwards, to compute the twoslice smoothed posteri
ors, pzt1,tx1:T ; next compute the conditionals pztzt1, x1:Tby normalizing; sample from
the initial pair of states, zpz x ; finally, recursively sample zpz z , x . 1,2 1,2 1:T t t t1 1:T
Note that the above solution requires a forwardsbackwards pass, and then an additional forwards sampling pass. An alternative is to do the forwards pass, and then perform sampling
17.4.5
616 Chapter 17. the probability of generating C3 is 0.2. Hence we have
21111, 1210.50.30.30.045
22111, 2220.50.70.20.07
Markov and hidden Markov models
17.81 17.82
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
17.5. Learning for HMMs 617 in the backwards pass. The key insight into how to do this is that we can write the joint from
right to left using
pz1:T x1:T pzT x1:T
1
pztzt1, x1:T
tT 1
We can then sample zt given future sampled states using
17.84
17.85
17.86
17.87 17.88 17.89 17.90
zspz z , x pz z , z , x , x pz zs
t t t1:T 1:T
The sampling distribution is given by
t t1 t2:T 1:t t1:T
, xt t1 1:t
pz iz j,x
pzz ,x ,x t t1 1:tt1
pzt1 , zt x1:t1pzt1 x1:t1
t t1
1:t
px
px
z ,z ,xpz
t1 t1t 1:t t1 t 1:t
,z x z pz z ,xpz x
t1
pzt1 x1:t1t1 t1 t 1:t
t 1:t
pzt1 x1:t1t1 j
t1ji, jti
The base case is
zTspzTix1:T T i
This algorithm forms the basis of blockedGibbs sampling methods for parameter inference,
as we will see below.
17.5 Learning for HMMs
We now discuss how to estimate the parameters ,A,B, where ipz1i is the initial state distribution, Ai,jpztjzt1i is the transition matrix, and B are the parameters of the classconditional densities pxtztj. We first consider the case where z1:T is observed in the training set, and then the harder case where z1:T is hidden.
17.5.1 Training with fully observed data
If we observe the hidden state sequences, we can compute the MLEs for A andexactly as in Section 17.2.2.1. If we use a conjugate prior, we can also easily compute the posterior.
The details on how to estimate B depend on the form of the observation model. The situation is identical to fitting a generative classifier. For example, if each state has a multinoulli distribution associated with it, with parameters BjlpXtlztj, where l1, . . . , L represents the observed symbol, the MLE is given by
X NTi
B Njl,NXIz j,x l 17.92
i1 t1
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
jl Nj jl
i,t i,t
17.91
Copyright2012. MIT Press. All rights reserved.
618 Chapter 17. Markov and hidden Markov models
This result is quite intuitive: we simply add up the number of times we are in state j and we see a symbol l, and divide by the number of times we are in state j.
Similarly, if each state has a Gaussian distribution associated with it, we have from Sec tion 4.2.4 the following MLEs:
kxk , kxxTk NkkTk Nk Nk
where the sufficient statistics are given by
N Ti
17.93
17.94
17.95
xkxxTk
i1 t1 N Ti
Izi,tkxi,t Izi,tkxi,txTi,t
i1 t1
Analogous results can be derived for other kinds of distributions. One can also easily extend all
of these results to compute MAP estimates, or even full posteriors over the parameters.
17.5.2 EM for HMMs the BaumWelch algorithm
If the zt variables are not observed, we are in a situation analogous to fitting a mixture model. The most common approach is to use the EM algorithm to find the MLE or MAP parameters, although of course one could use other gradientbased methods see e.g., Baldi and Chauvin 1994. In this Section, we derive the EM algorithm. When applied to HMMs, this is also known as the BaumWelch algorithm Baum et al. 1970.
17.5.2.1 E step
It is straightforward to show that the expected complete data log likelihood is given by
old K1 KK
Q,
E Nk logkENjklogAjk j1 k1
17.96 17.97
17.98
17.99 17.100
i1 t1 k1 where the expected counts are given by
E NjkENj
N E Nk1
k1
N Ti K
old pzt kxi,
logpxi,tk
pzi1 kxi,old N Ti
i1
old
pzi,t1j, zi,tkxi, old
i1 t1
i1 t2 N Ti
pzi,tjxi,
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
17.5.2.2
These expected sufficient statistics can be computed by running the forwardsbackwards algo rithm on each sequence. In particular, this algorithm computes the following smoothed node and edge marginals:
i,tjpztjxi,1:Ti ,17.101 i,tj, kpzt1j, ztkxi,1:Ti ,17.102
M step
Based on Section 11.3, we have that the M step for A andis to just normalize the expected counts:
ENjk ENk1
Ajk , k17.103
This result is quite intuitive: we simply add up the expected number of transitions from j to k, and divide by the expected number of times we transition from j to anything else.
For a multinoulli observation model, the expected sufficient statistics are
17.5. Learning for HMMs 619
E Mjl
i1 t1
The M step has the form
BjlEMjl ENj
i,tjIxi,tl
i,tj 17.104
17.105
k ENjk N
N Ti N
For a Gaussian observation model, the expected sufficient statistics are given by
N Ti
E xkTT
i,tkxi,t
E xxki,tkxi,txi,t
17.106 17.107
17.108
i1 t:xi,tl
This result is quite intuitive: we simply add up the expected number of times we are in state j and we see a symbol l, and divide by the expected number of times we are in state j.
i1 t1 The M step becomes
i1 t1 N Ti
ExExxTENk T kk , kk k k
17.5.2.3
Initialization
ENk ENk
This can and should be regularized in the same way we regularize GMMs.
As usual with EM, we must take care to ensure that we initialize the parameters carefully, to minimize the chance of getting stuck in poor local optima. There are several ways to do this, such as
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
620 Chapter 17. Markov and hidden Markov models
Use some fully labeled data to initialize the parameters.
Initially ignore the Markov dependencies, and estimate the observation parameters using the standard mixture model estimation methods, such as Kmeans or EM.
Randomly initialize the parameters, use multiple restarts, and pick the best solution.
Techniques such as deterministic annealing Ueda and Nakano 1998; Rao and Rose 2001 can help mitigate the effect of local minima. Also, just as Kmeans is often used to initialize EM for GMMs, so it is common to initialize EM for HMMs using Viterbi training, which means approximating the posterior over paths with the single most probable path. This is not necessarily a good idea, since initially the parameters are often poorly estimated, so the Viterbi path will be fairly arbitrary. A safer option is to start training using forwardsbackwards, and to switch to Viterbi near convergence.
17.5.3 Bayesian methods for fitting HMMs
EM returns a MAP estimate of the parameters. In this section, we briefly discuss some methods for Bayesian parameter estimation in HMMs. These methods rely on material that we will cover later in the book.
One approach is to use variational Bayes EM VBEM, which we discuss in general terms in Section 21.6. The details for the HMM case can be found in MacKay 1997; Beal 2003, but the basic idea is this: The E step uses forwardsbackwards, but where roughly speaking we plug in the posterior mean parameters instead of the MAP estimates. The M step updates the parameters of the conjugate posteriors, instead of updating the parameters themselves.
An alternative to VBEM is to use MCMC. A particularly appealing algorithm is block Gibbs sampling, which we discuss in general terms in Section 24.2.8. The details for the HMM case can be found in FruhwirthSchnatter 2007, but the basic idea is this: we sample z1:T given the data and parameters using forwardsfiltering, backwardssampling, and we then sample the parameters from their posteriors, conditional on the sampled latent paths. This is simple to implement, but one does need to take care of unidentifiability label switching, just as with mixture models see Section 11.3.1.
17.5.4 Discriminative training
Sometimes HMMs are used as the class conditional density inside a generative classifier. In this case, pxyc,can be computed using the forwards algorithm. We can easily maximize the joint likelihood Ni1 pxi,yi by using EM or some other method to fit the HMM for each classconditional density separately.
However, we might like to find the parameters that maximize the conditional likelihood
Npyipxiyi,
pyixi,pyicpxic,17.109
i1 ic
This is more expensive than maximizing the joint likelihood, since the denominator couples all C
classconditional HMMs together. Furthermore, EM can no longer be used, and one must resort
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
17.6. Generalizations of HMMs 621
to generic gradient based methods. Nevertheless, discriminative training can result in improved accuracies. The standard practice in speech recognition is to initially train the generative models separately using EM, and then to fine tune them discriminatively Jelinek 1997.
17.5.5 Model selection
In HMMs, the two main model selection issues are: how many states, and what topology to use for the state transition diagram. We discuss both of these issues below.
17.5.5.1 Choosing the number of hidden states
Choosing the number of hidden states K in an HMM is analogous to the problem of choosing the number of mixture components. Here are some possible solutions:
Use gridsearch over a range of Ks, using as an objective function crossvalidated likelihood, the BIC score, or a variational lower bound to the logmarginal likelihood.
Use reversible jump MCMC. See FruhwirthSchnatter 2007 for details. Note that this is very slow and is not widely used.
Use variational Bayes to extinguish unwanted components, by analogy to the GMM case discussed in Section 21.6.1.6. See MacKay 1997; Beal 2003 for details.
Use an infinite HMM, which is based on the hierarchical Dirichlet process. See e.g., Beal et al. 2002; Teh et al. 2006 for details.
17.5.5.2 Structure learning
The term structure learning in the context of HMMs refers to learning a sparse transition matrix. That is, we want to learn the structure of the state transition diagram, not the structure of the graphical model which is fixed. A large number of heuristic methods have been proposed. Most alternate between parameter estimation and some kind of heuristic split merge method see e.g., Stolcke and Omohundro 1992.
Alternatively, one can pose the problem as MAP estimation using a minimum entropy prior, of the form
pAi,:expH Ai,: 17.110
This prior prefers states whose outgoing distribution is nearly deterministic, and hence has low entropy Brand 1999. The corresponding M step cannot be solved in closed form, but numerical methods can be used. The trouble with this is that we might prune out all incoming transitions to a state, creating isolated islands in statespace. The infinite HMM presents an interesting alternative to these methods. See e.g., Beal et al. 2002; Teh et al. 2006 for details.
17.6 Generalizations of HMMs
Many variants of the basic HMM model have been proposed. We briefly discuss some of them below.
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
622 Chapter 17. Markov and hidden Markov models
Dt1 Dt Dt1
Qt1 Qt Qt1
Xt1 Xt Xt1
Figure 17.14 Encoding a hidden semiMarkov model as a DGM. Dt are deterministic duration counters.
17.6.1 Variable duration semiMarkov HMMs
In a standard HMM, the probability we remain in state i for exactly d steps is
ptid1AiiAdiiexpd log Aii 17.111
where Aii is the selfloop probability. This is called the geometric distribution. However, this kind of exponentially decaying function of d is sometimes unrealistic.
To allow for more general durations, one can use a semiMarkov model. It is called semi Markov because to predict the next state, it is not sufficient to condition on the past state: we also need to know how long weve been in that state. When the state space is not observed directly, the result is called a hidden semiMarkov model HSMM, a variable duration HMM, or an explicit duration HMM.
HSMMs are widely used in many gene finding programs, since the length distribution of exons and introns is not geometric see e.g., Schweikerta et al. 2009, and in some chipSeq data analysis programs see e.g., Kuan et al. 2009.
HSMMs are useful not only because they can model the waiting time of each state more accurately, but also because they can model the distribution of a whole batch of observations at once, instead of assuming all observations are conditionally iid. That is, they can use likelihood models of the form pxt:tlztk,dtl, which generate l correlated observations if the duration in state k is for l time steps. This is useful for modeling data that is piecewise linear, or shows other local trends Ostendorf et al. 1996.
17.6.1.1 HSMM as augmented HMMs
One way to represent a HSMM is to use the graphical model shown in Figure 17.14. In this figure, we have assumed the observations are iid within each state, but this is not required, as mentioned above. The Dt0, 1, . . . , D node is a state duration counter, where D is the maximum duration of any state. When we first enter state j, we sample Dt from the duration distribution for that state, Dtpj. Thereafer, Dt deterministically counts down
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
17.6. Generalizations of HMMs
623
0.012
0.01
0.008
0.006
0.004
0.002
0
n1
n2
n5
pppp 1p 1p 1p
1234
a
over sequence lengths, for p0.99 and various n. Figure generated by hmmSelfLoopDist.
untilDt 0. WhileDt 0,thestatezt isnotallowedtochange. WhenDt 0,wemakea stochastic transition to a new state.
17.6.1.2
Note that pjd could be represented as a table a nonparametric approach or as some kind of parametric distribution, such as a Gamma distribution. If pjd is a geometric distribution, this emulates a standard HMM.
One can perform inference in this model by defining a megavariable YtDt, zt. However, this is rather inefficient, since Dt is deterministic. It is possible to marginalize Dt out, and derive special purpose inference procedures. See Guedon 2003; Yu and Kobayashi 2006 for details. Unfortunately, all these methods take OTK2D time, where T is the sequence length, K is the number of states, and D is the maximum duration of any state.
Approximations to semiMarkov models
A more efficient, but less flexible, way to model nongeometric waiting times is to replace each state with n new states, each with the same emission probabilities as the original state. For example, consider the model in Figure 17.15a. Obviously the smallest sequence this can generate is of length n4. Any path of length d through the model has probability pdn1pn; multiplying by the number of possible paths we find that the total probability of a path of length d is
pdd1 pdn1pn 17.114 n1
b
Figure 17.15 a A Markov chain with n4 repeated states and self loops. b The resulting distribution
More precisely, we define the CPDs as follows:
pjd ifd0
pDt dDt1 d,zt j
pztkzt1j,Dt1d
101Ajk
0
ifd d1andd1 otherwise
ifd0andjk if d0
otherwise
17.112
17.113
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
0
100
200 300 400 500
600
Copyright2012. MIT Press. All rights reserved.
624
Chapter 17.
Markov and hidden Markov models
words on need
phones
sub phones
the
aa n end
n iy d end
dh ax
end
n iy
An example of an HHMM for an ASR system which can recognize 3 words. The top level represents bigram word probabilities. The middle level represents the phonetic spelling of each word. The bottom level represents the subphones of each phone. It is traditional to represent a phone as a 3 state HMM, representing the beginning, middle and end. Based on Figure 7.5 of Jurafsky and Martin 2000.
This is equivalent to the negative binomial distribution. By adjusting n and the selfloop probabilities p of each state, we can model a wide range of waiting times: see Figure 17.15b.
Let E be the number of expansions of each state needed to approximate pjd. Forwards backwards on this model takes OTKEFin time, where Fin is the average number of predecessor states, compared to OTKFinD for the HSMM. For typical speech recognition applications, Fin3, D50, K106, T105. Similar figures apply to problems such as gene finding, which also often uses HSMMs. Since FinDEFin, the expanded state method is much faster than an HSMM. See Johnson 2005 for details.
17.6.2 Hierarchical HMMs
A hierarchical HMM HHMM Fine et al. 1998 is an extension of the HMM that is designed to model domains with hierarchical structure. Figure 17.16 gives an example of an HHMM used in automatic speech recognition. The phone and subphone models can be called from different higher level contexts. We can always flatten an HHMM to a regular HMM, but a factored representation is often easier to interpret, and allows for more efficient inference and model fitting.
HHMMs have been used in many application domains, e.g., speech recognition Bilmes 2001, gene finding Hu et al. 2000, plan recognition Bui et al. 2002, monitoring transportation patterns Liao et al. 2007, indoor robot localization Theocharous et al. 2004, etc. HHMMs are less expressive than stochastic context free grammars SCFGs, since they only allow hierarchies of bounded depth, but they support more efficient inference. In particular, inference in SCFGs using the inside outside algorithm, Jurafsky and Martin 2008 takes OT3 whereas inference in an HHMM takes OT time Murphy and Paskin 2001.
We can represent an HHMM as a directed graphical model as shown in Figure 17.17. Qlt represents the state at time t and level l. A state transition at level l is only allowed if the
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Figure 17.16
end
end
Copyright2012. MIT Press. All rights reserved.
17.6. Generalizations of HMMs
625
Q1
Q21
Q31
Y1
F1 F21
Q12
F12 F2
Q2
F13 F23
Q32
F31
Q13
F32
Q23
F3
Q3
An HHMM represented as a DGM. Qlt is the state at time t, level l; Ftl1 if the HMM at level l has finished entered its exit state, otherwise Ftl0. Shaded nodes are observed; the remaining nodes are hidden. We may optionally clamp FTl1, where T is the length of the observation sequence, to ensure all models have finished by the end of the sequence. Source: Figure 2 of Murphy and Paskin 2001.
chain at the level below has finished, as determined by the Fl1 node. The chain below t
finishes when it chooses to enter its end state. This mechanism ensures that higher level chains evolve more slowly than lower level chains, i.e., lower levels are nested within higher levels.
A variable duration HMM can be thought of as a special case of an HHMM, where the top level is a deterministic counter, and the bottom level is a regular HMM, which can only change states once the counter has timed out. See Murphy and Paskin 2001 for further details.
17.6.3 Inputoutput HMMs
It is straightforward to extend an HMM to handle inputs, as shown in Figure 17.18a. This defines a conditional density model for sequences of the form
py1:T , z1:T u1:T ,17.115 where ut is the input at time t; this is sometimes called a control signal. If the inputs and
outputs are continuous, a typical parameterization would be
pztxt, zt1i, CatztSWiut 17.116
pytxt, ztj, N ytVj ut, j17.117 Thus the transition matrix is a logistic regression model whose parameters depend on the
previous state. The observation model is a Gaussian whose parameters depend on the current
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Figure 17.17
Y2 Y3
Copyright2012. MIT Press. All rights reserved.
626
Chapter 17.
Markov and hidden Markov models
ut1
zt1
ut
zt
z1z2 zT yt1 yt x1 x2 xT
a b
c
Figure 17.18 a Inputoutput HMM. b Firstorder autoregressive HMM. c A secondorder buried Markov model. Depending on the value of the hidden variables, the effective graph structure between the com ponents of the observed variables i.e., the nonzero elements of the regression matrix and the precision matrix can change, although this is not shown.
state. The whole model can be thought of as a hidden version of a maximum entropy Markov model Section 19.6.1.
Conditional on the inputs u1:T and the parameters , one can apply the standard forwards backwards algorithm to estimate the hidden states. It is also straightforward to derive an EM algorithm to estimate the parameters see Bengio and Frasconi 1996 for details.
17.6.4 Autoregressive and buried HMMs
The standard HMM assumes the observations are conditionally independent given the hidden state. In practice this is often not the case. However, it is straightforward to have direct arcs from xt1 to xt as well as from zt to xt, as in Figure 17.18b. This is known as an autoregressive HMM, or a regime switching Markov model. For continuous data, the observation model becomes
pxtxt1,zt j,NxtWjxt1 j,j 17.118
This is a linear regression model, where the parameters are chosen according to the current hidden state. We can also consider higherorder extensions, where we condition on the last L observations:
L l1
Such models are widely used in econometrics Hamilton 1990. Similar models can be defined for discrete observations.
The ARHMM essentially combines two Markov chains, one on the hidden variables, to capture long range dependencies, and one on the observed variables, to capture short range dependen cies Berchtold 1999. Since the X nodes are observed, the connections between them only
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
pxtxtL:t1,zt j,Nxt
Wj,lxtl j,j 17.119
Copyright2012. MIT Press. All rights reserved.
17.6. Generalizations of HMMs
627
z1,1 z1,2 z1,3
z2,1 z2,2 z2,3
z3,1 z3,2 z3,3
z11 x11
z21 x21
x12
x22
z12 z13 x13
z22 z23 x23
z31
x1 x2 x3 x31 x32 x33
a b
a A factorial HMM with 3 chains. b A coupled HMM with 3 chains.
z32 z33
Figure 17.19
change the computation of the local evidence; inference can still be performed using the stan dard forwardsbackwards algorithm. Parameter estimation using EM is also straightforward: the E step is unchanged, as is the M step for the transition matrix. If we assume scalar observations for notational simplicty, the M step involves minimizing
1
E 2stytytTL:t1wst2log 2st 17.120 t
Focussing on the w terms, we see that this requires solving K weighted least squares problems:
Jw1:K
2jytytL:t1wj
17.121
tj T 2
jt
where tjpztkx1:Tis the smoothed posterior marginal. This is a weighted linear regression problem, where the design matrix has a Toeplitz form. This subproblem can be solved efficiently using the LevinsonDurbin method Durbin and Koopman 2001.
Buried Markov models generalize ARHMMs by allowing the dependency structure between the observable nodes to change based on the hidden state, as in Figure 17.18c. Such a model is called a dynamic Bayesian multi net, since it is a mixture of different networks. In the linearGaussian setting, we can change the structure of the of xt1xt arcs by using sparse regression matrices, Wj, and we can change the structure of the connections within the components of xt by using sparse Gaussian graphical models, either directed or undirected. See Bilmes 2000 for details.
17.6.5 Factorial HMM
An HMM represents the hidden state using a single discrete random variable zt1, . . . , K . To represent 10 bits of information would require K2101024 states. By contrast, consider a distributed representation of the hidden state, where each zc,t0, 1 represents the cth
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
628 Chapter 17. Markov and hidden Markov models
bit of the tth hidden state. Now we can represent 10 bits using just 10 binary variables, as illustrated in Figure 17.19a. This model is called a factorial HMM Ghahramani and Jordan 1997. The hope is that this kind of model could capture different aspects of a signal, e.g., one chain would represent speaking style, another the words that are being spoken.
Unfortunately, conditioned on xt, all the hidden variables are correlated due to explaining away the common observed child xt. This make exact state estimation intractable. However, we can derive efficient approximate inference algorithms, as we discuss in Section 21.4.1.
17.6.6 Coupled HMM and the influence model
If we have multiple related data streams, we can use a coupled HMM Brand 1996, as illustrated
in Figure 17.19b. This is a series of HMMs where the state transitions depend on the states of
neighboring chains. That is, we represent the joint conditional distribution as
pzt zt1 pzct zt117.122 c
pzct zt1 pzct zc,t1 , zc1,t1 , zc1,t117.123
This has been used for various tasks, such as audiovisual speech recognition Nefian et al. 2002 and modeling freeway traffic flows Kwon and Murphy 2000.
The trouble with the above model is that it requires OCK4 parameters to specify, if there are C chains with K states per chain, because each state depends on its own past plus the past of its two neighbors. There is a closely related model, known as the influence model Asavathiratham 2000, which uses fewer parameters. It models the joint conditional distribution as
C c 1
where
matrices. The c,c parameter specifies how much influence chain c has on chain c. This model only takes OC2CK2 parameters to specify. Furthermore, it allows each chain to be influenced by all the other chains, not just its nearest neighbors. Hence the corresponding graphical model is similar to Figure 17.19b, except that each node has incoming edges from all the previous nodes. This has been used for various tasks, such as modeling conversational interactions between people Basu et al. 2001.
Unfortunately, inference in both of these models takes OTKC2 time, since all the chains become fully correlated even if the interaction graph is sparse. Various approximate inference methods can be applied, as we discuss later.
17.6.7 Dynamic Bayesian networks DBNs
A dynamic Bayesian network is just a way to represent a stochastic process using a directed
graphical model.7 Note that the network is not dynamic the structure and parameters are fixed,
7. The acronym DBN can stand for either dynamic Bayesian network or deep belief network Section 28.1 depending on the context. Geoff Hinton who invented the term deep belief network has suggested the acronyms DyBN and DeeBN to avoid this ambiguity.
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
pzctzt1
c,c pzctzc,t1 17.124
c c,c1 for each c. That is, we use a convex combination of pairwise transition
Copyright2012. MIT Press. All rights reserved.
17.6. Generalizations of HMMs
629
LeftClr0 LeftClr1 LeftClrSens1 25 RightClr0 RightClr1 RightClrSens1 21 LatAction0 LatAction1 TurnSignal1
28 26 22 24 18
15 11 9
5 2
27
23
Xdot0 Xdot1
XdotSens1
6 SensorValid1
20 InLane0 InLane1 YdotSens1
16 FwdAction0 FwdAction1
12 FYdotDiff1
13 FcloseSlow1
Fclr1
FYdotDiffSens1
17 Ydot0
19 Stopped0
7 EngStatus0
14 FrontBackStatus0 slice t
Ydot1
Stopped1 EngStatus1
FrontBackStatus1
slice t1
8 BXdot1 4 BcloseFast1
1 BYdotDiff1
10 FclrSens1 BXdotSens1
3 Bclr1
BclrSens1 BYdotDiffSens1
evidence
Figure 17.20 The BATnet DBN. The transient nodes are only shown for the second slice, to minimize clutter. The dotted lines can be ignored. Used with kind permission of Daphne Koller.
rather it is a network representation of a dynamical system. All of the HMM variants we have seen above could be considered to be DBNs. However, we prefer to reserve the term DBN for graph structures that are more irregular and problemspecific. An example is shown in Figure 17.20, which is a DBN designed to monitor the state of a simulated autonomous car known as the Bayesian Automated Taxi, or BATmobile Forbes et al. 1995.
Defining DBNs is straightforward: you just need to specify the structure of the first timeslice, the structure between two timeslices, and the form of the CPDs. Learning is also easy. The main problem is that exact inference can be computationally expensive, because all the hidden variables become correlated over time this is known as entanglementsee e.g., Koller and Friedman 2009, Sec. 15.2.4 for details. Thus a sparse graph does not necessarily result in tractable exact inference. However, later we will see algorithms that can exploit the graph structure for efficient approximate inference.
Exercises
Exercise 17.1 Derivation of Q function for HMM
Derive Equation 17.97.
Exercise 17.2 Two filter approach to smoothing in HMMs
Assuming that tipSti0 for all i and t, derive a recursive algorithm for updating rtipStixt1:T . Hint: it should be very similar to the standard forwards algorithm, but using a time reversed transition matrix. Then show how to compute the posterior marginals tipStix1:T
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
630 Chapter 17. Markov and hidden Markov models from the backwards filtered messages rti, the forwards filtered messages ti, and the stationary
distribution ti.
Exercise 17.3 EM for for HMMs with mixture of Gaussian observations
Consider an HMM where the observation model has the form
k
pxtztj,
Draw the DGM.
Derive the E step.
Derive the M step.
wjkNxtjk,jk
17.125
Exercise 17.4 EM for for HMMs with tied mixtures
In many applications, it is common that the observations are highdimensional vectors e.g., in speech recognition, xt is often a vector of cepstral coefficients and their derivatives, so xtR39, so estimating a
full covariance matrix for KM values where M is the number of mixture components per hidden state, as in Exercise 17.3, requires a lot of data. An alternative is to use just M Gaussians, rather than MK Gaussians, and to let the state influence the mixing weights but not the means and covariances. This is called a semicontinuous HMM or tiedmixture HMM.
Draw the corresponding graphical model.
Derive the E step.
Derive the M step.
Murphy, Kevin P.. iMachine Learning : A Probabilistic Perspectivei, MIT Press, 2012. ProQuest Ebook Central, http:ebookcentral.proquest.comlibgeorgetowndetail.action?docID3339490. Created from georgetown on 20191101 15:02:38.
Copyright2012. MIT Press. All rights reserved.
Reviews
There are no reviews yet.