Gas station without pumps

2015 April 2

Moving sampling lab early was a good idea

Filed under: Circuits course — gasstationwithoutpumps @ 20:00
Tags: , , , ,

This year I moved the sampling and aliasing lab to the first week, from the middle of the quarter.  To do this, I had to remove the design component from the lab: I gave them the circuit for the high-pass filter that re-centers the function generator output halfway between the power rails.  I didn’t tell the students that the function generator has the ability to set a DC offset itself, so we could have done the observations without building the filter, because a big point of the lab was to get students to translate a schematic into components and wires on a breadboard.  I was worried that the lab would not make sense before we’d covered RC filters, but by treating the filter as “magic” to be explained later, we could concentrate on the lab skills of wiring and running the PteroDAQ software, without them having to figure out a design at the same time.  The rest of the quarter will all be design labs.

I started the lab class with a mini-lecture covering:

• block diagram.  I constructed a block diagram starting from the two ends: the function generator and the PteroDAQ, and working out what was needed in the middle to make the output of the function generator match the input of the KL25Z analog-to-digital converter in the PteroDAQ.
• schematic. I redrew the schematic from my book:

The DC-blocking filter they built has a corner frequency of 0.03Hz, and they were looking at waveforms in the 1Hz–25Hz range.

I explained briefly how capacitors block DC but allow AC currents, how electrolytic capacitors work, and how the insulating layer self-repairs when the voltage is in the right polarity and how it can catastrophically fail with the polarity reversed.  (I did mention blowing up capacitors.)  I did not explain RC time constants, high-pass filters, or even voltage dividers, promising that all the theory for this circuit would be coming over the next two weeks.

• resistor color codes. I did not give the full resistor color code, but just explained the red-red-orange-gold bands on the 22E3Ω±5% resistors we had.
• breadboard layout. I explained what holes were connected to what on the breadboards, and why the central channel was there (for dual-inline packages, which also dictate the 0.1″ spacing).
• black/red color convention. I insist on students using black for GND and only for GND, and using red for the positive power rail (3.3V in this case) and only for the positive power rail.  I handed out about 4″ each of red, black, blue, and green 22-gauge wire to each pair of students.  (The lab has a couple of big spools of white and orange 24-gauge wire, but I’ve found 24-gauge unreliable in breadboards, so we’re going to use small amounts of 22-gauge instead.)
• plotting two channels with gnuplot.  The assignment called for plotting the same signal twice, once at 50Hz and again down-sampled 10× to 5Hz), so I needed to show them how to plot both on the same plot in gnuplot (plot “data.txt” using 1:2, “data.txt” using 1:3 ).
• V always relative to ground.  I reminded students that voltage is always the difference between points, and when we talk about the voltage at one node (like at the input), we always mean relative to the special node we call ground.
• SchemeIt to create schematics.  I suggested to students that since the report this week is primarily about learning to use the tools they’ll need all quarter, it would be best if they reproduced the schematic using SchemeIt, rather than just cutting-and-pasting from the book.
• function generator and oscilloscope demo. I demonstrated the use of the function generator and told them about the peak-to-peak voltage being a deliberate lie and needing to double what the function generator claims the output is (since we don’t use the expected 50Ω load, but an effectively infinite resistance).

After the mini-lecture, I had the students set up the function generators and observe the output on the digital oscilloscopes.  They also built their circuits, which I inspected before they powered them up.  There were the usual confusions about what holes were connected, but they seemed to clear up faster than usual this year.  For the morning lab, I had not mentioned the “V relative to ground” convention at the beginning of class and had to do a very short mini-lecture in the middle of class.  Perhaps because I made notes on the board of everything I ended up covering in the first lab, the second lab went a lot smoother.

I did find out this evening that one group had not read the assignment properly and missed the whole point of the sampling and aliasing lab—they only recorded one channel of data, not two at different sampling frequencies.  They’ll borrow correct data from another group (with appropriate credit).  I hope that the other group put appropriate metadata in the notes, so they can figure out what they are looking at.  I also hope that finding out at the last minute that they’d done the lab wrong will induce the students to read the assignments before lab in future, when the stakes are higher.  I am requiring pre-lab homework to be turned in on Mondays starting next week, so there will at least be a deadline before lab for them to have done some of the reading and thinking.

The students still have not gotten their full parts and tools kits.  They got breadboards and electrolytic capacitors today—we only loaned the 22kΩ resistors, since the resistor assortments have not arrived yet.

I was in the lab essentially all day 9:45—17:45, except for a break to eat my lunch just outside the lab between the two lab sections.  During that time I helped a lot of the students with installing the software they needed.  Incidentally, the widened hallway outside the lab is a popular study spot for engineering students (there are frequent formal and informal group study sessions there)—the area has been informally named “Jack’s Lounge” after the picture of a generous donor at the entrance (Jack Baskin, after whom both the building and the School of Engineering is named).  I don’t know if the implication is intended to be that Jack’s idea of “lounging” is studying hard, but if so it is rather flattering to him.

Next week’s lab will probably require more of my time, as they involve the temperature measurements, and I have to set up the hot water urn and ice water well before class starts.

2014 August 19

Two papers submitted

I got two papers with a co-author finished and submitted today.

One of them is based on the segmenter algorithms that I wrote up in this blog (with some improved parameterization): Segmenting noisy signals from nanopores, Segmenting filtered signals, and Segmenting noisy signals revisited.

The other one is an HMM paper that uses the automatic segmentation and an HMM that models the DNA motor behavior to do the same analysis that was done by hand in an earlier paper “Error Rates for Nanopore Discrimination Among Cytosine, Methylcytosine, and Hydroxymethylcytosine Along Individual DNA Strands” in PNAS.  The automatic HMM method was more accurate than the hand curation and feature extraction followed by sophisticated machine learning methods like support vector machines and random forests—I think that we were limited before by the biases of the hand curation and feature extraction.

Unfortunately, we were not able to do side-by-side comparison on exactly the same data, as the original data for the PNAS paper was lost in a 2-drive failure on a RAID that had not been properly backed up.

The paper writing did not take much of my time this summer, as my co-author did a lot of the writing and kept me from procrastinating too much.

I’ve also been working on my book, tentatively titled Applied Electronics for Bioengineers, but without a co-author to keep me honest, I’ve been doing a lot of procrastinating.  Yesterday I had planned to show the waveform of the gate signal on a nFET being used for pulse-width modulation, to show the “flat spot” that results from feedback through the gate-drain capacitance.  I fussed with that for quite a while, but never got a recording I liked the looks of—so ended up getting nothing written on the book.  I’ll probably try again tomorrow, but with a backup plan of working on a different section if that one gets too frustrating, or of writing some of the text that will go around the figure, once I get a figure I can use.

The book is currently 215 pages, but that includes the front matter and back matter.  The core of the book is only about 188 pages, with 74 figures.  I probably need a bunch more figures (schematics, graphs, photos, …), but those are what take the longest to produce.  There’s a little over a month left until classes start, but I don’t see myself finishing by then.

2014 June 17

Segmenting noisy signals revisited

This is the fourth in a series—the three previous are

A student (well, junior specialist this year, but student again in the fall) and I were trying to write up a journal paper describing the method and implementation, and I realized that the presentation in this blog was not clear enough for him to rewrite the math, and that I had left in some bits of fluff that were confusing. In this blog post I’ll try to merge the three previous posts into a single, coherent presentation using a Bayesian approach.

The segmenter uses a divide-and-conquer approach:

segment( time interval=[s,t) )
if the time t-s is too short, return no breakpoints.
Select the best breakpoint, i, to divide into two intervals [s,i) [i,t)
If the breakpoint is poor, reject it and return no breakpoints
otherwise return segment(s,i) + [i] + segment(i,t)


There are two essential parts to this algorithm: selecting the breakpoint and deciding whether to accept it.  Both rely on a Bayesian scoring scheme.  For the score, we’ll need two likelihood models, that is computable functions that compute the probability of seeing the observed data if it is generated by the corresponding model.  To score a potential breakpoint, i, we compare a model with the breakpoint (so that [s,i) and [i,t) are modeled independently) to one where the entire interval is modeled as a unit.

The simplest model, and one that has worked well for DNA data from the nanopore, is that each segment consists of a constant current with independent Gaussian noise added to each sample.  The likelihood for the data $x_{s}, \ldots , x_{t-1}$ is just $\prod_{s\leq j < t} P_{\mu,\sigma}(x_{j})$, where µ and σ are the mean and standard deviation of the Gaussian generating the samples in [s,t). I’ll abbreviate this product as $P_{\mu,\sigma}(s:t)$.

The log of the likelihood ratio for a breakpoint at position is $L_{i}=\ln \frac{P_{\mu_{1},\sigma_{1}}(s:i) P_{\mu_{2},\sigma_{2}}(i:t)}{P_{\mu_{0},\sigma_{0}}(s:t)} = \ln P_{\mu_{1},\sigma_{1}}(s:i) +\ln P_{\mu_{2},\sigma_{2}}(i:t)- \ln P_{\mu_{0},\sigma_{0}}(s:t)$,

where µ0 and σ0 are the mean and standard deviation for the whole window from s to t, µ1 and σ1 are the mean and standard deviation for s to i, and µ2 and σ3 are the mean and standard deviation for i to t.  Because we are using maximum likelihood estimates for all the µ and σ values, we are guaranteed that the log-likelihood Li is non-negative.

The log-probability for a single sample with a Gaussian distribution is
$\ln P_{\mu,\sigma}(x) = \ln \left( \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \right) = -0.5 \ln(2\pi) -\ln(\sigma) - \frac{(x-\mu)^2}{2\sigma^2}$
When we chose the optimal μ and σ for a segment, we did it so that σ2 is the variance, that is the average value of $(x-\mu)^2$. So when we sum the log probabilities over a segment of length n, using μ and σ optimized for that segment, we get $n( -0.5 \ln(2\pi) -\ln(\sigma) - 0.5)$.

That means that the log likelihood ratio we want simplifies to $L_{i} = (s-i)\ln \sigma_{1} + (i-t)\ln \sigma_{2} - (s-t)\ln \sigma_{0}$. This simplification works for any modeling that results in independent Gaussian residuals, so we can do stepwise sloping segments by doing linear regression almost as easily as doing stepwise constant segments.

Choosing the best breakpoint is now easy—we just pick the value for i that maximizes Li.

Deciding whether to keep or reject the breakpoint requires looking not just at the likelihood ratio, but at the posterior-odds ratio, which is the likelihood ratio times the prior odds ratio. The prior odds ratio is a parameter we can set independent of the data. For example, if we have an estimate of the expected number of segments per second (sps) for a particular experimental setup and our sampling frequency is fs, then the prior probability of a step at each sample is $\mbox{\it sps}/f_{s}$, the prior odds ratio is $\frac{ \mbox{\it sps} }{f_{s}-\mbox{\it sps}}$, and the score we want to use (log posterior odds ratio) is $L_{i} + \ln \mbox{\it sps} - \ln (f_{s}-\mbox{\it sps})$.

If we are willing to accept no more than FP false positives per second, then we want the probability of a score arising by chance at a sample to be less than $\mbox{FP}/ f_{s}$

Although I have no proof for it, if the samples are all drawn independently from a Gaussian distribution, empirically the log-likelihood scores seem to be perfectly distributed on an exponential distribution $P(L_{i} > x) = e^{-x}$. Note that this distribution does not depend on the length of the window, μ, or σ, though I did avoid looking at the first and last few positions in the window, in case there are edge effects.

However, if we low-pass filter the Gaussian noise, then the samples are no longer independent, and the log-likelihood scores go up. The distribution with a cutoff frequency of k times the Nyquist frequency is empirically $P(L_{i} > x) = e^{-k x}$. (The Nyquist frequency is half the sampling frequency and is the highest frequency that can be represented by the samples.)

If these empirically observed distributions are correct, then we have
$\frac{\mbox{FP}}{f_{s}} > P(\mbox{score} > x) = P\left(L_{i} > x \right) -\ln \mbox{\it sps} + \ln (f_{s}-\mbox{\it sps})= e^{-k x} \left(\frac{\mbox{\it sps} }{f_{s}-\mbox{\it sps}}\right)^{k}$
that is
$x > \ln \frac{\mbox{\it sps}}{f_{s}-\mbox{\it sps}} - \frac{1}{k}\ln\left(\frac{\mbox{FP}}{f_{s}}\right)$.

This is the same result as in More on segmenting noisy signals and Segmenting filtered signals, but there I made the mistake of changing parameterization from FP (false positives per second) to $\mbox{\it over} = \frac{\mbox{FP}}{\mbox{\it sps}}$ (false positives per expected segment), which makes the parameterization more confusing, and I was not clear about where the scaling went for the filtering.

On the example I’ve been checking, I get good results with sps=10 segments/sec and FP=1e-4 false positives/sec, but neither parameter is very critical—changing them by a factor of 1000 does not change the results much.  Dropping FP to 1e-200 does cause segments to be obviously missed.

2014 May 24

Segmenting filtered signals

Last August, I posted on Segmenting noisy signals from nanopores, and in January I updated that with improved parameterization in More on segmenting noisy signals.  Last week, Jacob Schreiber (the programmer for the nanopore group) found some problems with the method. He had re-implemented the method in Cython (rather than Python, as I had prototyped it), and got the same results as me when applied to the same examples I had used. But when he applied it to some new data, he got serious oversegmentation, with hundred of segments chosen where only one should have been.

I tracked down the problem to a difference in how we were applying the method.  I was applying it to data that had been low-pass filtered at 5kHz, recorded at a sampling frequency of 100kHz, then downsampled to 10kHz.  He took the same data without downsampling, then further filtered it with a low-pass filter of 1kHz or 20kHz.  I confirmed that the same data would be correctly segmented with my downsampling and severely oversampled with his extra filtering.  He conjectured that the problem was with the sampling frequency and that the problem could be fixed by just changing the desired gain in log Prob based on sampling frequency.

His argument didn’t seem quite right, as the underlying math was not dependent on sampling frequency.  What it did assume was that the samples within a segment were independent Gaussian-distributed values (more correctly, that the difference between the samples and the model being fitted was Gaussian).  Putting in a low-pass filter removes the independence assumption.  Indeed, with a very low cutoff frequency you expect to see the values changing only slowly, so the segmenter would see it as highly desirable to chop the signal up into short segments, each of which has low variance.

I confirmed that I could get serious oversegmentation even with very low sampling frequency, if the filter cutoff was set much lower.  I got reasonable segmentation results with only minor variations across a wide range of sampling frequencies if the filter cutoff frequency was close to the Nyquist frequency (half the sampling frequency), but oversegmentation if I filtered to 0.1 Nyquist frequency.

There are two ways to address the problem:

• When filtering, always resample at twice the cutoff frequency.
• Adjust the thresholds used according to the cutoff frequency and sampling frequency.

Adjusting the thresholds is a cleaner approach, if it can be made to work.  Today I I tried filtering Gaussian noise with a 5th-order Bessel filter (implemented by scipy.signal.lfilter) and looking at the distribution of the log-odds that we threshold:

$L_{i}=\log \frac{P_{\theta_{1}}(s:i) P_{\theta_{2}}(i:t)}{P_{\theta_{0}}(s:e)} = \log P_{\theta_{1}}(s:i) + \log P_{\theta_{2}}(i:t) - \log P_{\theta_{0}}(s:t)$,

where i is the breakpoint, θ1 and θ2 are the parameters for the two segments, and θ0 is the parameter setting for the whole segment.  For stepwise Gaussian models this simplifies to $L_{i} = (s-i)\ln \sigma_{1} + (i-t)\ln \sigma_{2} - (s-t)\ln \sigma_{0}$.   (To avoid the extra computation of taking square roots, the segmenter actually computes twice this, using variances instead of standard deviations.)

I found that the distributions were still very well fit by exponential distributions, but that the scaling factor changed with the cutoff frequency.   Furthermore, the log-odds seemed to scale linearly with the ratio of Nyquist frequency to cutoff frequency.  Reversing that scaling makes the curves superimpose nicely:

If I scale the log-odds by filter cutoff frequency as a fraction of the Nyquist frequency, the distributions seem to be almost identical. Unfiltered Gaussians have slightly lower log-odds values. Filtering, then downsampling by averaging blocks of 10 samples has slightly higher log-odds, probably because the averaging does a little more low-pass filtering.

Each curve of the plot was computed from only 1000 sequences of 10000 samples, so the results are vary a lot in the tail, due to not having enough data to get the tail exactly.

Because the segmentation method is not very sensitive to the threshold set, simply scaling thresholds by Nyquist frequency/cutoff frequency should work well.   The default parameters had been set for the downsampled 5kHz, so to match the old results in signals  that aren’t downsampled, the thresholds should probably be scaled by about 1.1*Nyquist frequency/cutoff frequency.  (The analog filter cutoff frequency is probably in the metatdata in the data files, but the abf format is messy enough that I’ve not tried to extract it.  The lower cutoff frequency of the analog filter or a subsequent digital filter should be used.)

« Previous Page