Gas station without pumps

2014 August 19

Two papers submitted

I got two papers with a co-author finished and submitted today.

One of them is based on the segmenter algorithms that I wrote up in this blog (with some improved parameterization): Segmenting noisy signals from nanopores, Segmenting filtered signals, and Segmenting noisy signals revisited.

The other one is an HMM paper that uses the automatic segmentation and an HMM that models the DNA motor behavior to do the same analysis that was done by hand in an earlier paper “Error Rates for Nanopore Discrimination Among Cytosine, Methylcytosine, and Hydroxymethylcytosine Along Individual DNA Strands” in PNAS.  The automatic HMM method was more accurate than the hand curation and feature extraction followed by sophisticated machine learning methods like support vector machines and random forests—I think that we were limited before by the biases of the hand curation and feature extraction.

Unfortunately, we were not able to do side-by-side comparison on exactly the same data, as the original data for the PNAS paper was lost in a 2-drive failure on a RAID that had not been properly backed up.

The paper writing did not take much of my time this summer, as my co-author did a lot of the writing and kept me from procrastinating too much.

I’ve also been working on my book, tentatively titled Applied Electronics for Bioengineers, but without a co-author to keep me honest, I’ve been doing a lot of procrastinating.  Yesterday I had planned to show the waveform of the gate signal on a nFET being used for pulse-width modulation, to show the “flat spot” that results from feedback through the gate-drain capacitance.  I fussed with that for quite a while, but never got a recording I liked the looks of—so ended up getting nothing written on the book.  I’ll probably try again tomorrow, but with a backup plan of working on a different section if that one gets too frustrating, or of writing some of the text that will go around the figure, once I get a figure I can use.

The book is currently 215 pages, but that includes the front matter and back matter.  The core of the book is only about 188 pages, with 74 figures.  I probably need a bunch more figures (schematics, graphs, photos, …), but those are what take the longest to produce.  There’s a little over a month left until classes start, but I don’t see myself finishing by then.

2014 June 17

Segmenting noisy signals revisited

This is the fourth in a series—the three previous are

A student (well, junior specialist this year, but student again in the fall) and I were trying to write up a journal paper describing the method and implementation, and I realized that the presentation in this blog was not clear enough for him to rewrite the math, and that I had left in some bits of fluff that were confusing. In this blog post I’ll try to merge the three previous posts into a single, coherent presentation using a Bayesian approach.

The segmenter uses a divide-and-conquer approach:

segment( time interval=[s,t) )
    if the time t-s is too short, return no breakpoints.
    Select the best breakpoint, i, to divide into two intervals [s,i) [i,t)
    If the breakpoint is poor, reject it and return no breakpoints
       otherwise return segment(s,i) + [i] + segment(i,t)

There are two essential parts to this algorithm: selecting the breakpoint and deciding whether to accept it.  Both rely on a Bayesian scoring scheme.  For the score, we’ll need two likelihood models, that is computable functions that compute the probability of seeing the observed data if it is generated by the corresponding model.  To score a potential breakpoint, i, we compare a model with the breakpoint (so that [s,i) and [i,t) are modeled independently) to one where the entire interval is modeled as a unit.

The simplest model, and one that has worked well for DNA data from the nanopore, is that each segment consists of a constant current with independent Gaussian noise added to each sample.  The likelihood for the data x_{s}, \ldots , x_{t-1} is just \prod_{s\leq j < t} P_{\mu,\sigma}(x_{j}), where µ and σ are the mean and standard deviation of the Gaussian generating the samples in [s,t). I’ll abbreviate this product as P_{\mu,\sigma}(s:t).

The log of the likelihood ratio for a breakpoint at position is L_{i}=\ln \frac{P_{\mu_{1},\sigma_{1}}(s:i) P_{\mu_{2},\sigma_{2}}(i:t)}{P_{\mu_{0},\sigma_{0}}(s:t)} = \ln P_{\mu_{1},\sigma_{1}}(s:i) +\ln P_{\mu_{2},\sigma_{2}}(i:t)- \ln P_{\mu_{0},\sigma_{0}}(s:t),

where µ0 and σ0 are the mean and standard deviation for the whole window from s to t, µ1 and σ1 are the mean and standard deviation for s to i, and µ2 and σ3 are the mean and standard deviation for i to t.  Because we are using maximum likelihood estimates for all the µ and σ values, we are guaranteed that the log-likelihood Li is non-negative.

The log-probability for a single sample with a Gaussian distribution is
\ln P_{\mu,\sigma}(x) = \ln \left( \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \right) = -0.5 \ln(2\pi) -\ln(\sigma) - \frac{(x-\mu)^2}{2\sigma^2}
When we chose the optimal μ and σ for a segment, we did it so that σ2 is the variance, that is the average value of (x-\mu)^2. So when we sum the log probabilities over a segment of length n, using μ and σ optimized for that segment, we get n( -0.5 \ln(2\pi) -\ln(\sigma) - 0.5).

That means that the log likelihood ratio we want simplifies to L_{i} = (s-i)\ln \sigma_{1} + (i-t)\ln \sigma_{2} - (s-t)\ln \sigma_{0}. This simplification works for any modeling that results in independent Gaussian residuals, so we can do stepwise sloping segments by doing linear regression almost as easily as doing stepwise constant segments.

Choosing the best breakpoint is now easy—we just pick the value for i that maximizes Li.

Deciding whether to keep or reject the breakpoint requires looking not just at the likelihood ratio, but at the posterior-odds ratio, which is the likelihood ratio times the prior odds ratio. The prior odds ratio is a parameter we can set independent of the data. For example, if we have an estimate of the expected number of segments per second (sps) for a particular experimental setup and our sampling frequency is fs, then the prior probability of a step at each sample is \mbox{\it sps}/f_{s}, the prior odds ratio is \frac{ \mbox{\it sps} }{f_{s}-\mbox{\it sps}}, and the score we want to use (log posterior odds ratio) is L_{i} + \ln \mbox{\it sps} - \ln (f_{s}-\mbox{\it sps}).

If we are willing to accept no more than FP false positives per second, then we want the probability of a score arising by chance at a sample to be less than \mbox{FP}/ f_{s}

Although I have no proof for it, if the samples are all drawn independently from a Gaussian distribution, empirically the log-likelihood scores seem to be perfectly distributed on an exponential distribution P(L_{i} > x) = e^{-x}. Note that this distribution does not depend on the length of the window, μ, or σ, though I did avoid looking at the first and last few positions in the window, in case there are edge effects.

However, if we low-pass filter the Gaussian noise, then the samples are no longer independent, and the log-likelihood scores go up. The distribution with a cutoff frequency of k times the Nyquist frequency is empirically P(L_{i} > x) = e^{-k x}. (The Nyquist frequency is half the sampling frequency and is the highest frequency that can be represented by the samples.)

If these empirically observed distributions are correct, then we have
\frac{\mbox{FP}}{f_{s}} > P(\mbox{score} > x) = P\left(L_{i} > x \right) -\ln \mbox{\it sps} + \ln (f_{s}-\mbox{\it sps})= e^{-k x} \left(\frac{\mbox{\it sps} }{f_{s}-\mbox{\it sps}}\right)^{k}
that is
x > \ln \frac{\mbox{\it sps}}{f_{s}-\mbox{\it sps}} - \frac{1}{k}\ln\left(\frac{\mbox{FP}}{f_{s}}\right).

This is the same result as in More on segmenting noisy signals and Segmenting filtered signals, but there I made the mistake of changing parameterization from FP (false positives per second) to \mbox{\it over} = \frac{\mbox{FP}}{\mbox{\it sps}} (false positives per expected segment), which makes the parameterization more confusing, and I was not clear about where the scaling went for the filtering.

On the example I’ve been checking, I get good results with sps=10 segments/sec and FP=1e-4 false positives/sec, but neither parameter is very critical—changing them by a factor of 1000 does not change the results much.  Dropping FP to 1e-200 does cause segments to be obviously missed.

2014 May 24

Segmenting filtered signals

Last August, I posted on Segmenting noisy signals from nanopores, and in January I updated that with improved parameterization in More on segmenting noisy signals.  Last week, Jacob Schreiber (the programmer for the nanopore group) found some problems with the method. He had re-implemented the method in Cython (rather than Python, as I had prototyped it), and got the same results as me when applied to the same examples I had used. But when he applied it to some new data, he got serious oversegmentation, with hundred of segments chosen where only one should have been.

I tracked down the problem to a difference in how we were applying the method.  I was applying it to data that had been low-pass filtered at 5kHz, recorded at a sampling frequency of 100kHz, then downsampled to 10kHz.  He took the same data without downsampling, then further filtered it with a low-pass filter of 1kHz or 20kHz.  I confirmed that the same data would be correctly segmented with my downsampling and severely oversampled with his extra filtering.  He conjectured that the problem was with the sampling frequency and that the problem could be fixed by just changing the desired gain in log Prob based on sampling frequency.

His argument didn’t seem quite right, as the underlying math was not dependent on sampling frequency.  What it did assume was that the samples within a segment were independent Gaussian-distributed values (more correctly, that the difference between the samples and the model being fitted was Gaussian).  Putting in a low-pass filter removes the independence assumption.  Indeed, with a very low cutoff frequency you expect to see the values changing only slowly, so the segmenter would see it as highly desirable to chop the signal up into short segments, each of which has low variance.

I confirmed that I could get serious oversegmentation even with very low sampling frequency, if the filter cutoff was set much lower.  I got reasonable segmentation results with only minor variations across a wide range of sampling frequencies if the filter cutoff frequency was close to the Nyquist frequency (half the sampling frequency), but oversegmentation if I filtered to 0.1 Nyquist frequency.

There are two ways to address the problem:

  • When filtering, always resample at twice the cutoff frequency.
  • Adjust the thresholds used according to the cutoff frequency and sampling frequency.

Adjusting the thresholds is a cleaner approach, if it can be made to work.  Today I I tried filtering Gaussian noise with a 5th-order Bessel filter (implemented by scipy.signal.lfilter) and looking at the distribution of the log-odds that we threshold:

L_{i}=\log \frac{P_{\theta_{1}}(s:i) P_{\theta_{2}}(i:t)}{P_{\theta_{0}}(s:e)} = \log P_{\theta_{1}}(s:i) + \log P_{\theta_{2}}(i:t) - \log P_{\theta_{0}}(s:t),

where i is the breakpoint, θ1 and θ2 are the parameters for the two segments, and θ0 is the parameter setting for the whole segment.  For stepwise Gaussian models this simplifies to L_{i} = (s-i)\ln \sigma_{1} + (i-t)\ln \sigma_{2} - (s-t)\ln \sigma_{0}.   (To avoid the extra computation of taking square roots, the segmenter actually computes twice this, using variances instead of standard deviations.)

I found that the distributions were still very well fit by exponential distributions, but that the scaling factor changed with the cutoff frequency.   Furthermore, the log-odds seemed to scale linearly with the ratio of Nyquist frequency to cutoff frequency.  Reversing that scaling makes the curves superimpose nicely:

    If I scale the log-odds by filter cutoff frequency as a fraction of the Nyquist frequency, the distributions seem to be almost identical. Unfiltered Gaussians have slightly lower log-odds values. Filtering, then downsampling by averaging blocks of 10 samples has slightly higher log-odds, probably because the averaging does a little more low-pass filtering.

If I scale the log-odds by filter cutoff frequency as a fraction of the Nyquist frequency, the distributions seem to be almost identical. Unfiltered Gaussians have slightly lower log-odds values. Filtering, then downsampling by averaging blocks of 10 samples has slightly higher log-odds, probably because the averaging does a little more low-pass filtering.

Each curve of the plot was computed from only 1000 sequences of 10000 samples, so the results are vary a lot in the tail, due to not having enough data to get the tail exactly.

Because the segmentation method is not very sensitive to the threshold set, simply scaling thresholds by Nyquist frequency/cutoff frequency should work well.   The default parameters had been set for the downsampled 5kHz, so to match the old results in signals  that aren’t downsampled, the thresholds should probably be scaled by about 1.1*Nyquist frequency/cutoff frequency.  (The analog filter cutoff frequency is probably in the metatdata in the data files, but the abf format is messy enough that I’ve not tried to extract it.  The lower cutoff frequency of the analog filter or a subsequent digital filter should be used.)


2013 February 21

Sampling lab went ok

Filed under: Circuits course,Data acquisition — gasstationwithoutpumps @ 21:14
Tags: , , , , ,

Today’s sampling and aliasing lab was one I expected to go fairly quickly, but it took longer than I thought.  The students had two design tasks and then a bunch of observations. The design tasks were supposed to have been easy ones that they did as a prelab, but everyone took a bit longer than I thought, and some really struggled with them.

The first design task was to design a high-pass filter to do level shifting of a signal from a signal generator (the signal generator is capable of being set to center at a voltage other than 0v, but I needed students to practice level shifting before they do the class-D audio amplifier and the EKG labs).  I gave them a very low corner frequency (0.03Hz).  Students didn’t have trouble with the RC time constant (mostly), but they did have trouble with the notion of using a voltage divider as a Thévenin equivalent of a resistor to the desired center voltage, though we had just done that yesterday in class in analyzing the do-now problems.  I think that they all got it in the end, but I’ll definitely have to consider including some sort of level-shifting question on the quiz.

They looked at the signal output from the high-pass filter using the scope (to make sure that the voltage range was appropriate), then hooked it up to the Arduino and ran the DataLogger code.  I had them run the signal into two pins, with one sampling at 40Hz or 50Hz and the other at 1/5 that (8Hz or 10Hz), and then look at various frequencies.  I may have to specify some specific frequencies for them to look at next year, since they tended to pick simple multiples of 1Hz, which does not reveal some of the interesting beat patterns that you get at 4.9Hz and 5.1Hz.  The DataLogger code worked quite well for this application, though one student managed to tickle an error message by leaving the down-sampling field blank (it should probably default to 1 in that case, rather than reporting an error).  One could do all the visualization with a purely software simulation lab, but the students learned a fair amount by designing and wiring the RC filters, as well as getting more experience with the oscilloscopes and function generator.

The second design task was to design a low-pass filter with a corner frequency of 4Hz.  For this one, most of them chose to do a 2.5v virtual ground with an op-amp circuit, though there was no need, since the capacitor blocks any DC and so could have been connected directly to ground. Using a virtual ground actually makes it harder to use the electrolytic capacitors without reverse biasing them.  This may get to be important in the LC filter for the class-D amplifier, so I’ll probably have to talk about proper biasing of electrolytic capacitors in class.

I did do the strobe demo at the beginning of lab time, but it was not as good a demo as I had hoped to do.  I’ll have to think of ways to improve it for next year.  Problems included that the strobe light was not bright enough (you can’t turn off all the lights in the lab) and that the spinning paper propeller did not have an adjustable speed, so I couldn’t match the propeller to the strobe (just the strobe to the propeller).  Perhaps I need to choose a better moving object next year, where the strobe light will have a more obvious relationship to the sampling of sine waves in the rest of the lab.

Tomorrow I’ll need to start teaching about instrumentation amps, and get the students to choose lab partners to work with over the weekend, so that they can come in with questions on Monday, since the first instrumentation amp lab is likely to cause them problems.



2013 February 19

Pressure-sensor lab handout written

Filed under: Circuits course — gasstationwithoutpumps @ 22:32
Tags: , , , ,

In All weekend and handouts still not written, I complained

I spent all day Saturday and Sunday (except for a few hours grading) working on the lab handout for the class-D power amplifier lab.  It was going great on Saturday, but I decided to include a portion of the lab on doing an LC output filter, rather than just directly connecting the speaker between ground and the FETs of the output stage as I had originally planned.  That opened a big can of worms.

and I concluded with

Since I need to do more experimenting with the class-D design to find the problems before handing it over to the students, I’ve moved that lab from week 8 to week 9, giving me another week to finish the handout (which is already 12 pages long, and not finished yet).  That means that I have 2 days to write the new lab 8: the instrumentation amplifier for the strain-gauge pressure sensor.  I’ve built and tested such an amplifier (even demoed it on the first day of class), so there aren’t any surprises waiting for me, but I have to write up not only an explanation of instrumentation amps, but also how to do layout for the protoboard.

Hmm, I just realized that I don’t have any posts on the blog for the rev3.0 instrumentation amp protoboard that I designed using the MCP6004 quad op-amp chip instead of the MCP6002 dual op-amp chip (still with the INA126P instrumentation amp chip).  My old test of the pressure sensor lab was with the old protoboard. Maybe I need to redo the pressure-sensor lab with the new board, to make sure that there aren’t any problems with the PC board design.

So I’ve got today and tomorrow to redo the pressure-sensor lab and write up the lab handout, plus finish the grading and figure out how I’m going to present sampling and aliasing on Wednesday, before the Thursday lab.  (I plan to use my son’s stroboscope, but I’ve not figured out what motion we’ll sample with it.)

Well, it is now Tuesday night ending my 4 day all-work weekend (all my weekends have been like that for quite a while).  I did get the grading done and the grades recorded. I also got the handout done for the pressure sensor lab—I even got feedback from my co-instructor on a draft (pointing out at least three things that I needed to fix).  I also designed and soldered up another amplifier today, this one adding a low-pass filter between the INA126P instrumentation amp and a second-stage op amp. The new protoboard works fine and is easier to work with than my first protoboard design (which wasted too much space in providing power connections—now I just take +5V and Gnd through small screw terminals).

I took the unfiltered instrumentation amp output and the higher-gain filtered output both out to screw terminals and looked at them with the Arduino.  There is less noise on the filtered output, but whether that is due to the filter or to the higher gain (so less discretization noise in the ADC) is not clear.  I’ll probably have to look at them with a higher resolution device than the Arduino ADC to see.

Incidentally, I really like the block of 4 screw terminals with 0.1″ pitch that I’m using for the protoboards and the pressure sensor breakout boards.  They may not be able to handle the current of the 3.5mm and 5mm screw terminals that Adafruit and Sparkfun use, but they take up less space and fit much more neatly on the 0.1″ grid.  They provide me with much easier and more secure connections than the header pins that I used to use.  (With the 0.1″ spacing, if I want header pins, I can just fasten a row of double-headed header pins into the screw connectors.)

I’m worried about how much time soldering the amplifier will take.  Despite all the admonishments I’ve given and will give them, I suspect that half the students will come to lab next week with the prelabs only half done—many won’t have done a carefully checked schematic and layout before coming to lab, so they’ll be rushing to do a half-assed one that is missing crucial connections or shorts power and ground.  And then their soldered board won’t work and they’ll spend many more hours trying to debug it than it would have taken them to check their schematic carefully in the first place before wiring things up.  I made everyone redo the lab writeups for the first soldering lab, because they’d mostly been very sloppy in their schematics (among other problems).  Even on the redone lab reports, after being told that they needed to fix and double-check their schematics, there were still a lot of the same sorts of errors, and the lab reports for lab 5 (the first audio amp lab) were not much better.

I wonder if all the engineering students have the same carelessness about details and poor lab notebook skills.  I have not been teaching students how to keep a lab notebook. I thought that had already been covered in their previous 4 years, and I tend to be a poor example: I had to reverse engineer my previous instrumentation amp circuit for the pressure sensor, because I couldn’t find the schematic anywhere—I then made a clean copy of the schematic with CircuitLab, and put a printout of the schematic and the board together in an envelope, so they would not get separated again. I also printed out the schematics for today’s amplifier design, and put them together with the design specs and the layout information in an envelope with the board I designed today.

I’ve been using this blog as my lab notebook for things related to the course, except for designs that I don’t want to give away to the students (which is why the schematic for the pressure sensor amplifier does not appear in this blog, which in turn is why I lost it).  I’ve only occasionally used a paper lab notebook, generally preferring “README” files in the computer directories associated with particular projects.  It’s not as good if I ever want to patent something, but it is a lot easier for me to maintain.

I still haven’t figured out exactly what I’ll do tomorrow in class.  I’ll bring my son’s stroboscope (a Xenon flash tube one made from a Velleman kit), but I’m not sure what I’ll use for periodic motion.  I’ll also need to prepare a “do-now” problem, to keep them sharp on their voltage dividers and op amps.

Next weekend will be dedicated (again) to doing the class-D amplifier lab handout, and building the class-D amplifier to make sure that the lab is doable.


Next Page »

%d bloggers like this: