Gas station without pumps

2015 April 8

Optimization and model fitting went well today

Filed under: Circuits course — gasstationwithoutpumps @ 21:47
Tags: , , , ,

Today’s lecture in BME 101 (the Applied Electronics for Bioengineers class) went very smoothly. I started with a little light entertainment: a video about blowing up capacitors, sent to me by Jameco: http://www.jameco.com/Jameco/content/incorrectly-using-capacitors.html This video reinforced the message that I had given them in lab last week about electrolytic capacitors.

I had two topics to cover: the optimization problem from last weekend’s homework, which only one student had managed to do, and model fitting to extract parameters for the thermistor model from the data collected in lab yesterday.

I gave the students the choice of order for the topics, and they wanted the optimization first.  The problem is fairly straightforward from an engineering standpoint: we want to maximize the sensitivity of the circuit that converts temperature to voltage at a specific operating temperature Topt. I first sketched very rough plots of what the voltage would look like as a function of temperature: monotone increasing but not particularly linear. I reminded them that sensitivity is the change in output that you get for a given change in input: dV/dT.  I coaxed out of the class the idea that to maximize the function dV/dT, we need to take its derivative and set it to 0: \frac{d^2 V }{d T^2} = 0.  We only have one free variable in our circuit to change (the load resistor R), so we need to take that equation and solve it for R, to get the value of R that maximizes sensitivity.

I reminded them of the simple model for the resistance of a thermistor that we’d had on Monday: R_{T} = R_{\infty} e^{B/T}, and they had no trouble coming up with the equation for the output voltage V = V_{in} \frac{R}{R+R_{\infty} e^{B/T}}.  I then suggested using Wolfram Alpha to solve the equation, and switched from the chalkboard to the screen to type

solve d^2 (R/(R+S e^(B/T)))/ dT^2 = 0 for R

(changing R to S, to avoid confusing Wolfram Alpha with subscripts).
Wolfram Alpha conveniently replied with two solutions: R=0 (which would result in a constant 0 output voltage, clearly a minimum sensitivity) and R= S e^{B/T} \frac{B-2 T}{B+ 2T}. I pointed out that the first part was just the resistance of the thermistor at the temperature we were optimizing for, and the second term scales that down a little.

We then moved on to fitting the model. I showed three data sets that students had sent me—one which was a little messy, but still quite usable, one that was a little better, and one that was really beautiful.  I continued with the really good data set.  I explained that gnuplot tries to minimize the error on the y-axis for a function of things on the x-axis, and the students decided that minimizing the temperature error for a given resistance was probably best here. (I suspect it doesn’t make much difference with this data, but it is good to get them to think about that as a decision that needs to be made for each modeling problem.)

On the board, I turned around the equation for the model, to get temperature in terms of R: T = \frac{B}{\ln R - \ln R_{\infty}}.

I then developed the gnuplot script for the problem live, debugging as I went.  Some of the errors to debug were ones I introduced deliberately, others were inadvertent, but all were good for teaching both the process of debugging and the notion of doing sanity checks.  My first mistake was an inadvertent one: I capitalized a variable differently in the parameter list and in the function body, and gnuplot is case-sensitive.  This was very easy to find and fix, and gave me an opportunity to tell students that gnuplot was sensitive to case, since that had not come up previously. My second mistake was a deliberate one: I typed in the model exactly as we had derived it on the board, knowing full well that the model was for °K, but the data was for °C.  I’d even reminded students of that earlier, when we were doing the optimization problem.

I then ran the fit in gnuplot and asked students if we were done, or if there was a sanity check we could do.  A couple of them asked if we could plot the model and the data on the same plot, so I did that. The fit was very obviously completely wrong. So I asked the students what was wrong and how we could fix it. I did dice-assisted cold calling to put 3 or 4 students on the spot, getting not very useful answers, then took answers from a couple of students who raised their hands. The second student pointed out the °C vs °K problem.

So I fixed the formula and ran the fit again, expecting everything to work fine.  It didn’t! So we went into debugging mode, trying to see if starting with better estimates of B and R would help. Nope. Then I realized that I had made a second inadvertent error: I’d put in 273.15 with the wrong sign! I fixed that, ran the fit again, and again things failed (producing negative values for R).  This time, though when I put in better initial estimates, everything converged to a very good fit, which gave me a teachable moment about the need to have decent estimates before fitting, so that the optimization algorithm that did the fitting could converge.

The final script was

temp(R,B,Rinf) = B / ( log(R) - log(Rinf)) - 273.15
B=3000; Rinf=1e-4
fit temp(x,B,Rinf) 'student-thermistor-data-3.txt' using 2:1 via Rinf,B
plot 'student-thermistor-data-3.txt' using 2:1, \
      temp(x,3435,10*exp(-3435/(25+273.15))), \
      temp(x,B,Rinf)

The second curve of the plot is using the data sheet values for the B value and for R25°C=10kΩ.
Here is the plot produced:

Notice that I did not have axis labels, title, nor echoing of the parameters in the key.  We covered that last week and I'm hoping that students can carry skills over from one week to the next.

Notice that I did not have axis labels, title, nor echoing of the parameters in the key. We covered that last week and I’m hoping that students can carry skills over from one week to the next.

The model fit is slightly better than the data sheet values, but the data sheet values were closer than I had expected based on previous years’ fits. I think that it may be that the students had a properly calibrated thermometer (I’d removed any digital thermometers that reported ice water as warmer than 0.5°C—about 1/3 of them were that far out of calibration), and that this particular pair of students had been very careful with their measurements.

2010 November 26

Lagrangian Multipliers and HMMs

Filed under: Uncategorized — gasstationwithoutpumps @ 00:00
Tags: , , , ,

Where in an undergraduate STEM major is the technique of Lagrangian multipliers for solving constrained optimization problems taught?

Doing a search for it on my campus found a few classes, all of which are upper-division or graduate classes: MATH 140 (Industrial Math), Econ 100 A(Intermediate Microeconomics), Econ 205C (Advanced Macroeconomic Theory III), ISM 206 (Optimization Theory and Applications), Physics 105(Classical Mechanics), CMPS 242 (Machine Learning), CMPS 290C (Advanced Topics in Machine Learning), BME 205 (Bioinformatics: Models and Algorithms). Students in my class had a vague recollection of having heard of Lagrangian multipliers, but had never actually used them (except for a couple of exchange students from Europe).

My personal belief is that such a fundamental, useful technique should be included in the lower-division multi-variable differential calculus class. Now, this isn’t a curmudgeonly complaint about how students nowadays don’t learn things the way we did, back in the good old days of slide rules and tables of logarithms (and, yes, I am just barely old enough to have learned with those tools). I didn’t learn about Lagrangian multipliers until I was a professor (having completed a B.S. and M.S. in math and a Ph.D. in computer science).  I was introduced to the method by an engineering grad student (from India, if I remember correctly), who had had a much more practical math education than mine.  In fact, most of my practical math education came either from my father or was self-taught.  Very little of my math education in school or college turned out to have much lasting value to me, other than allowing me the confidence to learn on my own the math I actually needed.

For those of you completely unfamiliar with Lagrangian multipliers, I recommend the Wikipedia page, Lagrangian multipliers. For bioinformatics students, I will try to present one example, using the technique to derive the Baum-Welch algorithm, which is the expectation-maximization algorithm used for training hidden Markov models (HMMs). These are the notes for one lecture in my graduate class in bioinformatics, coming after we have developed the forward and backward algorithms for HMMs. For those who want to learn this stuff on their own, I recommend Chapter 3 of Biological Sequence Analysis by Durbin, Eddy, Krogh, and Mitchison, though I don’t remember them using Lagrangian multipliers.

[I’m presenting these notes in my blog for two reasons: 1) I wanted to plead for more teachers to teach Lagrangian multipliers, and 2) my presentation of it in class sucked.  In my defense, I had only had four hours of sleep the night before, and live-action math requires more alertness than I had with so much sleep debt.]

Our goal is to choose the parameters of the hidden Markov model to maximize the probability of the HMM generating a particular observed set of data.  The EM (Expectation Maximization) algorithm is an iterative approximation algorithm.  It starts with a setting of the parameters, then chooses a new setting of parameters that increases the probability, and repeats.  What we are working on here is doing a single step: given a setting of the parameters, how do we find a better setting?

The parameters are e_S(c), the probability of emitting character c in state S, and a_{ST}, the probability of making a transition to state T from state S (I’m using the notation of the book here, though it is awkward to mix subscript and function notation, and would have been cleaner to use e_{Sc}). It is common to refer to the parameters as a vector \Theta, so that we can concisely refer to functions of $\latex \Theta$. We have the constraints that \sum_c e_S(c) = 1 and $\latex \sum_T a_{ST} = 1$, that is that emission probabilities for a state S sum to one, and so do the transition probabilities to a new state.  (There are also constraints that all probabilities are non-negative, but those constraints don’t need to be explicitly represented in solving the problem, as the solution we’ll come up with automatically satisfies that.)

The probability of a sequence X_1, \ldots , X_n is the sum over all paths of length n through the HMM of the probability of taking that path and emitting X, that is P_\Theta(X) = \sum_{\pi_1, \ldots, \pi_n} \prod_{i=1}^n a_{\pi_{i-1}\pi{i}} e_{\pi_i}(X_i). The products are  a bit messy to take derivatives of, because some of the parameters may be repeated,  but we can simplify them by grouping together repeated factors:

P_\Theta(X) = \sum_{\pi_1, \ldots, \pi_n} \prod_{c,S,T} a_{ST}^{n_{ST}} (e_{T}(c))^{m_{Tc}},

where n_{ST} is the number of times \pi_{i-1}=S and \pi_{i}=T along a particular path and m_{Tc} is the number of times \pi_{i}=T and X_i=c.  Both of these new count variables are specific to a particular path, so should really have the \pi vector as subscripts, but we’ve got too many subscripts already.

If we try to maximize this function without paying attention to the constraints, we find that all the parameters go to infinity—making the parameters bigger makes the product bigger. To keep the constraints relevant, we use Lagrangian multipliers, adding one new variable for each constraint.  I’ll use \lambda_S for the constraints on the emission table of state S and \lambda^\prime_S for the constraints on the transitions out of state S.  The objective function to take derivatives of is thus

F(X,\Theta) = \sum_{\pi_1, \ldots, \pi_n} \prod_{c,S,T} a_{ST}^{n_{ST}} (e_{T}(c))^{m_{Tc}} + \sum_S \lambda_S (1-\sum_c e_S(c)) + \sum_S \lambda^\prime_S(1-\sum_T a_{ST}) ~.

Let’s look at the partial derivatives of our objective function with respect to the emission parameters:

\frac{\partial F(X,\Theta)}{\partial e_R(d)} = \sum_{\pi_1, \ldots, \pi_n} \frac{m_{Rd}}{e_R(d)} \prod_{c,S,T} a_{ST}^{n_{ST}} (e_T(c))^{m_{Tc}} - \lambda_R~.

If the partials are set to zero, we get

\lambda_R e_R(d) = \sum_{\pi_1, \ldots, \pi_n} m_{Rd} \prod_{c,S,T} a_{ST}^{n_{ST}} (e_{T}(c))^{m_{Tc}}~.

The sum over all paths formula is the expected count of the number of times that state R produces character $d$. Note that this happens only  for positions i where X_i=d, so we can rewrite it as the sum over all such positions of the probability of being in state R at that position, which is precisely what the forward-backward algorithm computed for us:

\lambda_R e_R(d) = \sum_{i: X_i=d} f(i,R)b(i,R)~.

We can now impose the constraint \sum_d e_R(d) = 1, by adding all these equations:

\lambda_R= \sum_{d} \sum_{i: X_i=d} f(i,R) b(i,R) ~.

Finally we get the new value of the emission parameters:

e_R(d) = \frac{\sum_{i: X_i=d} f(i,R)b(i,R)}{\sum_{d} \sum_{i: X_i=d} f(i,R)b(i,R)} ~.

We can do a very similar series of operations to get update equations for the transition parameters also:

a_{ST} = \frac{\sum_{i} f(i,S)a_{ST}e_T(X_i)b(i+1,T)}{\sum_{R} \sum_{i} f(i,S)a_{SR}e_R(X_i)b(i+1,R)}~.

Note that the optimization we did uses the parameters in the forward-backward algorithm, so we haven’t really found “the optimum”, just a better setting of the parameters, so we need to iterate the whole process, recomputing the forward-backward matrices after resetting all the parameters.

%d bloggers like this: