# Gas station without pumps

## 2014 August 19

### Two papers submitted

I got two papers with a co-author finished and submitted today.

One of them is based on the segmenter algorithms that I wrote up in this blog (with some improved parameterization): Segmenting noisy signals from nanopores, Segmenting filtered signals, and Segmenting noisy signals revisited.

The other one is an HMM paper that uses the automatic segmentation and an HMM that models the DNA motor behavior to do the same analysis that was done by hand in an earlier paper “Error Rates for Nanopore Discrimination Among Cytosine, Methylcytosine, and Hydroxymethylcytosine Along Individual DNA Strands” in PNAS.  The automatic HMM method was more accurate than the hand curation and feature extraction followed by sophisticated machine learning methods like support vector machines and random forests—I think that we were limited before by the biases of the hand curation and feature extraction.

Unfortunately, we were not able to do side-by-side comparison on exactly the same data, as the original data for the PNAS paper was lost in a 2-drive failure on a RAID that had not been properly backed up.

The paper writing did not take much of my time this summer, as my co-author did a lot of the writing and kept me from procrastinating too much.

I’ve also been working on my book, tentatively titled Applied Electronics for Bioengineers, but without a co-author to keep me honest, I’ve been doing a lot of procrastinating.  Yesterday I had planned to show the waveform of the gate signal on a nFET being used for pulse-width modulation, to show the “flat spot” that results from feedback through the gate-drain capacitance.  I fussed with that for quite a while, but never got a recording I liked the looks of—so ended up getting nothing written on the book.  I’ll probably try again tomorrow, but with a backup plan of working on a different section if that one gets too frustrating, or of writing some of the text that will go around the figure, once I get a figure I can use.

The book is currently 215 pages, but that includes the front matter and back matter.  The core of the book is only about 188 pages, with 74 figures.  I probably need a bunch more figures (schematics, graphs, photos, …), but those are what take the longest to produce.  There’s a little over a month left until classes start, but I don’t see myself finishing by then.

## 2014 June 17

### Segmenting noisy signals revisited

This is the fourth in a series—the three previous are

A student (well, junior specialist this year, but student again in the fall) and I were trying to write up a journal paper describing the method and implementation, and I realized that the presentation in this blog was not clear enough for him to rewrite the math, and that I had left in some bits of fluff that were confusing. In this blog post I’ll try to merge the three previous posts into a single, coherent presentation using a Bayesian approach.

The segmenter uses a divide-and-conquer approach:

segment( time interval=[s,t) )
if the time t-s is too short, return no breakpoints.
Select the best breakpoint, i, to divide into two intervals [s,i) [i,t)
If the breakpoint is poor, reject it and return no breakpoints
otherwise return segment(s,i) + [i] + segment(i,t)


There are two essential parts to this algorithm: selecting the breakpoint and deciding whether to accept it.  Both rely on a Bayesian scoring scheme.  For the score, we’ll need two likelihood models, that is computable functions that compute the probability of seeing the observed data if it is generated by the corresponding model.  To score a potential breakpoint, i, we compare a model with the breakpoint (so that [s,i) and [i,t) are modeled independently) to one where the entire interval is modeled as a unit.

The simplest model, and one that has worked well for DNA data from the nanopore, is that each segment consists of a constant current with independent Gaussian noise added to each sample.  The likelihood for the data $x_{s}, \ldots , x_{t-1}$ is just $\prod_{s\leq j < t} P_{\mu,\sigma}(x_{j})$, where µ and σ are the mean and standard deviation of the Gaussian generating the samples in [s,t). I’ll abbreviate this product as $P_{\mu,\sigma}(s:t)$.

The log of the likelihood ratio for a breakpoint at position is $L_{i}=\ln \frac{P_{\mu_{1},\sigma_{1}}(s:i) P_{\mu_{2},\sigma_{2}}(i:t)}{P_{\mu_{0},\sigma_{0}}(s:t)} = \ln P_{\mu_{1},\sigma_{1}}(s:i) +\ln P_{\mu_{2},\sigma_{2}}(i:t)- \ln P_{\mu_{0},\sigma_{0}}(s:t)$,

where µ0 and σ0 are the mean and standard deviation for the whole window from s to t, µ1 and σ1 are the mean and standard deviation for s to i, and µ2 and σ3 are the mean and standard deviation for i to t.  Because we are using maximum likelihood estimates for all the µ and σ values, we are guaranteed that the log-likelihood Li is non-negative.

The log-probability for a single sample with a Gaussian distribution is
$\ln P_{\mu,\sigma}(x) = \ln \left( \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \right) = -0.5 \ln(2\pi) -\ln(\sigma) - \frac{(x-\mu)^2}{2\sigma^2}$
When we chose the optimal μ and σ for a segment, we did it so that σ2 is the variance, that is the average value of $(x-\mu)^2$. So when we sum the log probabilities over a segment of length n, using μ and σ optimized for that segment, we get $n( -0.5 \ln(2\pi) -\ln(\sigma) - 0.5)$.

That means that the log likelihood ratio we want simplifies to $L_{i} = (s-i)\ln \sigma_{1} + (i-t)\ln \sigma_{2} - (s-t)\ln \sigma_{0}$. This simplification works for any modeling that results in independent Gaussian residuals, so we can do stepwise sloping segments by doing linear regression almost as easily as doing stepwise constant segments.

Choosing the best breakpoint is now easy—we just pick the value for i that maximizes Li.

Deciding whether to keep or reject the breakpoint requires looking not just at the likelihood ratio, but at the posterior-odds ratio, which is the likelihood ratio times the prior odds ratio. The prior odds ratio is a parameter we can set independent of the data. For example, if we have an estimate of the expected number of segments per second (sps) for a particular experimental setup and our sampling frequency is fs, then the prior probability of a step at each sample is $\mbox{\it sps}/f_{s}$, the prior odds ratio is $\frac{ \mbox{\it sps} }{f_{s}-\mbox{\it sps}}$, and the score we want to use (log posterior odds ratio) is $L_{i} + \ln \mbox{\it sps} - \ln (f_{s}-\mbox{\it sps})$.

If we are willing to accept no more than FP false positives per second, then we want the probability of a score arising by chance at a sample to be less than $\mbox{FP}/ f_{s}$

Although I have no proof for it, if the samples are all drawn independently from a Gaussian distribution, empirically the log-likelihood scores seem to be perfectly distributed on an exponential distribution $P(L_{i} > x) = e^{-x}$. Note that this distribution does not depend on the length of the window, μ, or σ, though I did avoid looking at the first and last few positions in the window, in case there are edge effects.

However, if we low-pass filter the Gaussian noise, then the samples are no longer independent, and the log-likelihood scores go up. The distribution with a cutoff frequency of k times the Nyquist frequency is empirically $P(L_{i} > x) = e^{-k x}$. (The Nyquist frequency is half the sampling frequency and is the highest frequency that can be represented by the samples.)

If these empirically observed distributions are correct, then we have
$\frac{\mbox{FP}}{f_{s}} > P(\mbox{score} > x) = P\left(L_{i} > x \right) -\ln \mbox{\it sps} + \ln (f_{s}-\mbox{\it sps})= e^{-k x} \left(\frac{\mbox{\it sps} }{f_{s}-\mbox{\it sps}}\right)^{k}$
that is
$x > \ln \frac{\mbox{\it sps}}{f_{s}-\mbox{\it sps}} - \frac{1}{k}\ln\left(\frac{\mbox{FP}}{f_{s}}\right)$.

This is the same result as in More on segmenting noisy signals and Segmenting filtered signals, but there I made the mistake of changing parameterization from FP (false positives per second) to $\mbox{\it over} = \frac{\mbox{FP}}{\mbox{\it sps}}$ (false positives per expected segment), which makes the parameterization more confusing, and I was not clear about where the scaling went for the filtering.

On the example I’ve been checking, I get good results with sps=10 segments/sec and FP=1e-4 false positives/sec, but neither parameter is very critical—changing them by a factor of 1000 does not change the results much.  Dropping FP to 1e-200 does cause segments to be obviously missed.

## 2014 May 24

### Segmenting filtered signals

Last August, I posted on Segmenting noisy signals from nanopores, and in January I updated that with improved parameterization in More on segmenting noisy signals.  Last week, Jacob Schreiber (the programmer for the nanopore group) found some problems with the method. He had re-implemented the method in Cython (rather than Python, as I had prototyped it), and got the same results as me when applied to the same examples I had used. But when he applied it to some new data, he got serious oversegmentation, with hundred of segments chosen where only one should have been.

I tracked down the problem to a difference in how we were applying the method.  I was applying it to data that had been low-pass filtered at 5kHz, recorded at a sampling frequency of 100kHz, then downsampled to 10kHz.  He took the same data without downsampling, then further filtered it with a low-pass filter of 1kHz or 20kHz.  I confirmed that the same data would be correctly segmented with my downsampling and severely oversampled with his extra filtering.  He conjectured that the problem was with the sampling frequency and that the problem could be fixed by just changing the desired gain in log Prob based on sampling frequency.

His argument didn’t seem quite right, as the underlying math was not dependent on sampling frequency.  What it did assume was that the samples within a segment were independent Gaussian-distributed values (more correctly, that the difference between the samples and the model being fitted was Gaussian).  Putting in a low-pass filter removes the independence assumption.  Indeed, with a very low cutoff frequency you expect to see the values changing only slowly, so the segmenter would see it as highly desirable to chop the signal up into short segments, each of which has low variance.

I confirmed that I could get serious oversegmentation even with very low sampling frequency, if the filter cutoff was set much lower.  I got reasonable segmentation results with only minor variations across a wide range of sampling frequencies if the filter cutoff frequency was close to the Nyquist frequency (half the sampling frequency), but oversegmentation if I filtered to 0.1 Nyquist frequency.

There are two ways to address the problem:

• When filtering, always resample at twice the cutoff frequency.
• Adjust the thresholds used according to the cutoff frequency and sampling frequency.

Adjusting the thresholds is a cleaner approach, if it can be made to work.  Today I I tried filtering Gaussian noise with a 5th-order Bessel filter (implemented by scipy.signal.lfilter) and looking at the distribution of the log-odds that we threshold:

$L_{i}=\log \frac{P_{\theta_{1}}(s:i) P_{\theta_{2}}(i:t)}{P_{\theta_{0}}(s:e)} = \log P_{\theta_{1}}(s:i) + \log P_{\theta_{2}}(i:t) - \log P_{\theta_{0}}(s:t)$,

where i is the breakpoint, θ1 and θ2 are the parameters for the two segments, and θ0 is the parameter setting for the whole segment.  For stepwise Gaussian models this simplifies to $L_{i} = (s-i)\ln \sigma_{1} + (i-t)\ln \sigma_{2} - (s-t)\ln \sigma_{0}$.   (To avoid the extra computation of taking square roots, the segmenter actually computes twice this, using variances instead of standard deviations.)

I found that the distributions were still very well fit by exponential distributions, but that the scaling factor changed with the cutoff frequency.   Furthermore, the log-odds seemed to scale linearly with the ratio of Nyquist frequency to cutoff frequency.  Reversing that scaling makes the curves superimpose nicely:

If I scale the log-odds by filter cutoff frequency as a fraction of the Nyquist frequency, the distributions seem to be almost identical. Unfiltered Gaussians have slightly lower log-odds values. Filtering, then downsampling by averaging blocks of 10 samples has slightly higher log-odds, probably because the averaging does a little more low-pass filtering.

Each curve of the plot was computed from only 1000 sequences of 10000 samples, so the results are vary a lot in the tail, due to not having enough data to get the tail exactly.

Because the segmentation method is not very sensitive to the threshold set, simply scaling thresholds by Nyquist frequency/cutoff frequency should work well.   The default parameters had been set for the downsampled 5kHz, so to match the old results in signals  that aren’t downsampled, the thresholds should probably be scaled by about 1.1*Nyquist frequency/cutoff frequency.  (The analog filter cutoff frequency is probably in the metatdata in the data files, but the abf format is messy enough that I’ve not tried to extract it.  The lower cutoff frequency of the analog filter or a subsequent digital filter should be used.)