Gas station without pumps

2010 November 26

Lagrangian Multipliers and HMMs

Filed under: Uncategorized — gasstationwithoutpumps @ 00:00
Tags: , , , ,

Where in an undergraduate STEM major is the technique of Lagrangian multipliers for solving constrained optimization problems taught?

Doing a search for it on my campus found a few classes, all of which are upper-division or graduate classes: MATH 140 (Industrial Math), Econ 100 A(Intermediate Microeconomics), Econ 205C (Advanced Macroeconomic Theory III), ISM 206 (Optimization Theory and Applications), Physics 105(Classical Mechanics), CMPS 242 (Machine Learning), CMPS 290C (Advanced Topics in Machine Learning), BME 205 (Bioinformatics: Models and Algorithms). Students in my class had a vague recollection of having heard of Lagrangian multipliers, but had never actually used them (except for a couple of exchange students from Europe).

My personal belief is that such a fundamental, useful technique should be included in the lower-division multi-variable differential calculus class. Now, this isn’t a curmudgeonly complaint about how students nowadays don’t learn things the way we did, back in the good old days of slide rules and tables of logarithms (and, yes, I am just barely old enough to have learned with those tools). I didn’t learn about Lagrangian multipliers until I was a professor (having completed a B.S. and M.S. in math and a Ph.D. in computer science).  I was introduced to the method by an engineering grad student (from India, if I remember correctly), who had had a much more practical math education than mine.  In fact, most of my practical math education came either from my father or was self-taught.  Very little of my math education in school or college turned out to have much lasting value to me, other than allowing me the confidence to learn on my own the math I actually needed.

For those of you completely unfamiliar with Lagrangian multipliers, I recommend the Wikipedia page, Lagrangian multipliers. For bioinformatics students, I will try to present one example, using the technique to derive the Baum-Welch algorithm, which is the expectation-maximization algorithm used for training hidden Markov models (HMMs). These are the notes for one lecture in my graduate class in bioinformatics, coming after we have developed the forward and backward algorithms for HMMs. For those who want to learn this stuff on their own, I recommend Chapter 3 of Biological Sequence Analysis by Durbin, Eddy, Krogh, and Mitchison, though I don’t remember them using Lagrangian multipliers.

[I’m presenting these notes in my blog for two reasons: 1) I wanted to plead for more teachers to teach Lagrangian multipliers, and 2) my presentation of it in class sucked.  In my defense, I had only had four hours of sleep the night before, and live-action math requires more alertness than I had with so much sleep debt.]

Our goal is to choose the parameters of the hidden Markov model to maximize the probability of the HMM generating a particular observed set of data.  The EM (Expectation Maximization) algorithm is an iterative approximation algorithm.  It starts with a setting of the parameters, then chooses a new setting of parameters that increases the probability, and repeats.  What we are working on here is doing a single step: given a setting of the parameters, how do we find a better setting?

The parameters are e_S(c), the probability of emitting character c in state S, and a_{ST}, the probability of making a transition to state T from state S (I’m using the notation of the book here, though it is awkward to mix subscript and function notation, and would have been cleaner to use e_{Sc}). It is common to refer to the parameters as a vector \Theta, so that we can concisely refer to functions of $\latex \Theta$. We have the constraints that \sum_c e_S(c) = 1 and $\latex \sum_T a_{ST} = 1$, that is that emission probabilities for a state S sum to one, and so do the transition probabilities to a new state.  (There are also constraints that all probabilities are non-negative, but those constraints don’t need to be explicitly represented in solving the problem, as the solution we’ll come up with automatically satisfies that.)

The probability of a sequence X_1, \ldots , X_n is the sum over all paths of length n through the HMM of the probability of taking that path and emitting X, that is P_\Theta(X) = \sum_{\pi_1, \ldots, \pi_n} \prod_{i=1}^n a_{\pi_{i-1}\pi{i}} e_{\pi_i}(X_i). The products are  a bit messy to take derivatives of, because some of the parameters may be repeated,  but we can simplify them by grouping together repeated factors:

P_\Theta(X) = \sum_{\pi_1, \ldots, \pi_n} \prod_{c,S,T} a_{ST}^{n_{ST}} (e_{T}(c))^{m_{Tc}},

where n_{ST} is the number of times \pi_{i-1}=S and \pi_{i}=T along a particular path and m_{Tc} is the number of times \pi_{i}=T and X_i=c.  Both of these new count variables are specific to a particular path, so should really have the \pi vector as subscripts, but we’ve got too many subscripts already.

If we try to maximize this function without paying attention to the constraints, we find that all the parameters go to infinity—making the parameters bigger makes the product bigger. To keep the constraints relevant, we use Lagrangian multipliers, adding one new variable for each constraint.  I’ll use \lambda_S for the constraints on the emission table of state S and \lambda^\prime_S for the constraints on the transitions out of state S.  The objective function to take derivatives of is thus

F(X,\Theta) = \sum_{\pi_1, \ldots, \pi_n} \prod_{c,S,T} a_{ST}^{n_{ST}} (e_{T}(c))^{m_{Tc}} + \sum_S \lambda_S (1-\sum_c e_S(c)) + \sum_S \lambda^\prime_S(1-\sum_T a_{ST}) ~.

Let’s look at the partial derivatives of our objective function with respect to the emission parameters:

\frac{\partial F(X,\Theta)}{\partial e_R(d)} = \sum_{\pi_1, \ldots, \pi_n} \frac{m_{Rd}}{e_R(d)} \prod_{c,S,T} a_{ST}^{n_{ST}} (e_T(c))^{m_{Tc}} - \lambda_R~.

If the partials are set to zero, we get

\lambda_R e_R(d) = \sum_{\pi_1, \ldots, \pi_n} m_{Rd} \prod_{c,S,T} a_{ST}^{n_{ST}} (e_{T}(c))^{m_{Tc}}~.

The sum over all paths formula is the expected count of the number of times that state R produces character $d$. Note that this happens only  for positions i where X_i=d, so we can rewrite it as the sum over all such positions of the probability of being in state R at that position, which is precisely what the forward-backward algorithm computed for us:

\lambda_R e_R(d) = \sum_{i: X_i=d} f(i,R)b(i,R)~.

We can now impose the constraint \sum_d e_R(d) = 1, by adding all these equations:

\lambda_R= \sum_{d} \sum_{i: X_i=d} f(i,R) b(i,R) ~.

Finally we get the new value of the emission parameters:

e_R(d) = \frac{\sum_{i: X_i=d} f(i,R)b(i,R)}{\sum_{d} \sum_{i: X_i=d} f(i,R)b(i,R)} ~.

We can do a very similar series of operations to get update equations for the transition parameters also:

a_{ST} = \frac{\sum_{i} f(i,S)a_{ST}e_T(X_i)b(i+1,T)}{\sum_{R} \sum_{i} f(i,S)a_{SR}e_R(X_i)b(i+1,R)}~.

Note that the optimization we did uses the parameters in the forward-backward algorithm, so we haven’t really found “the optimum”, just a better setting of the parameters, so we need to iterate the whole process, recomputing the forward-backward matrices after resetting all the parameters.

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: