Gas station without pumps

2014 February 1

More on segmenting noisy signals

Filed under: Uncategorized — gasstationwithoutpumps @ 10:34
Tags: , ,

Last August, I posted on Segmenting noisy signals from nanopores, but I’ve always been a bit unsatisfied with the parameterization of that method.  The basic idea is to have a parameterized probabilistic model for a segment, Pθ(s:t), where θ is some parameter setting for the model optimized for the half-open time interval [s,t).  We compare the probability of the segment modeled as two subsegments with the segment modeled as a single unit:

L_{i}=\log \frac{P_{\theta_{1}}(s:i) P_{\theta_{2}}(i:t)}{P_{\theta_{0}}(s:e)} = \log P_{\theta_{1}}(s:i) + \log P_{\theta_{2}}(i:t) - \log P_{\theta_{0}}(s:t),

where i is the breakpoint, θ1 and θ2 are the parameters for the two segments, and θ0 is the parameter setting for the whole segment.  I showed in the previous post how to simplify this for stepwise Gaussian models to L_{i} = (s-i)\ln \sigma_{1} + (i-t)\ln \sigma_{2} - (s-t)\ln \sigma_{0}.  Note that translation or scaling of the samples will not affect this log-likelihood, as the means have already cancelled, and scaling adds a constant to all the \ln\sigma values, and that constant also cancels.

To find the best breakpoint in an interval, we check each possible breakpoint, and pick the one that maximizes Li. We then decide whether that Li is big enough to keep the breakpoint or not using a user-settable threshold. If the threshold is set very low, then the user is saying that there is not much noise and we should follow the signal closely. If the threshold is set high, then the user is saying that there is a lot of noise and we need to be very confident of any breakpoint we claim.

What has been bothering me is how we’ve been setting the threshold.  I’d really like to have an easily interpreted parameter that we can set independent of sampling rate or other confounding factors.

I started out by looking at how L_{i} is distributed for sequences that are generated from a single Gaussian model (with no step).  I did a bunch of simulation experiments, and I was surprised to find that the probability density function appears to be a simple exponential distribution Pr(L_{i} \geq x ) = e^{-x} for x\geq 0, and that this distribution does not depend on how big a sequence is being split.

That lack of dependence on how big the sequence is means that I don’t have to adjust the threshold based on how big a window I’m looking at. I had originally thought that there ought to be such an adjustment, but I got better results by keeping the threshold fixed.

If the empirically observed exponential distribution of L_{i} is correct, then we can parameterize the threshold in terms of a false-positives-per-second parameter, F. If we divide by the sampling frequency, we get a false-positive-per-sample F/f_{s}. The threshold T needs to be set so that Pr(L_{i} \geq T ) = e^{-T} \leq F/f_{s}, or T\geq -\ln(F/f_{s}).

But specifying a tolerance for false positives is not good enough, as one generally wants about the same false positive rate independent of how fast the signal is changing, but the threshold should be tighter if you expect the signal to remain constant for a long time, and looser if you want to track the signal changes rapidly.  I think that this can be handled by not using the likelihood ratio, but the posterior odds ratio, in a properly Bayesian way.  The user of the segmenter expresses how finely they want things divided in terms of expected steps per second, sps, as well as an acceptable overcalling rate, over, (say 1% of segments getting split), which sets the false-positive rate F=sps*over.

The prior probability of a step at each sample is \mbox{\it sps}/f_{s}, so the prior odds ratio is \frac{ \mbox{\it sps} }{f_{s}-\mbox{\it sps}}, and the posterior odds ratio is L_{i} \frac{ \mbox{\it sps} }{f_{s}-\mbox{\it sps}}.  The threshold on likelihood would then be set to T=-\ln \frac{ \mbox{\it sps} }{f_{s}-\mbox{\it sps}}-\ln(F/f_{s})\approx -2\ln(\mbox{\it sps}) -\ln(\mbox{\it over})+2\ln(f_{s}).

On the sample event that I’ve been using for testing, setting the segments/second to 10, and the oversegmentation rate to 0.01 gets me pretty much the segmentation I want to see, with an average segment length of 68.79ms (14.5 segments per second).  Adjusting the prior segments/second to 1.0 gets me 11.6 segments per second, and to 100.0 gets me 18.4 segments/second, so the prior is having about the right effect on the segmentation, adjusting the rate of segmentation without overwhelming the data.  Since the underlying segmenter is the same as before (only the parameters used for setting the threshold have changed),  it may still be difficult to find settings that work for some nanopore experiments.

2013 August 10

Segmenting noisy signals from nanopores

Filed under: Uncategorized — gasstationwithoutpumps @ 10:05
Tags: , ,

Today’s post is not about what or how I teach, but about some of the research I’ve been doing lately.  I’ve been collaborating with another faculty member and his students on analyzing nanopore data.

The basic idea of a nanopore is simple: you have a thin membrane separating two salt-water baths, and stick a very tiny hole in it.  You put a voltage difference between the two baths and measure the current through the hole.  If you can arrange for large molecules to pass through the hole, they block some of the current flow, and you can monitor the passage of the molecules through the holes by monitoring the current.

In my collaborator’s lab, they use a lipid bilayer (similar to a cell membrane) as the separating membrane, and a porin protein to punch a hole in the membrane.  (They used to use α-hemolysin as the nanopore, but are now trying MSP-A, which has some better properties.)  Mostly they pass DNA through the pore, since funding is most available for DNA sequencing technologies, but they’ve also explored other large biomolecules.

Side view of the MSPA pore, from PDB file 1UUN

Side view of the MSPA pore, from PDB file 1UUN

Top view of the MSPA pore, showing the 13.8Å diameter hole that it makes in a membrane.

Top view of the MSPA pore, showing the 13.8Å diameter hole that it makes in a membrane.

One problem with nanopores, as with almost all single-molecule techniques, is that the signals are really small, nearly buried in noise (see my post Noise in nanopores).  We are often looking for changes of less than a pA (that’s 10–12 amps) with noise that may be 2pA RMS or more.

When things are working the way we want, the large molecule moves stepwise through the pore. For example, with DNA in the pore, we want to attach some sort of DNA motor to pull the DNA through. But we don’t want a smooth, continuous motion—we want a motor that advances the DNA by one nucleotide rapidly, then stays there for a long time so we can have a constant current to identify what is in the pore, then advances again by one and stays there, and so forth. Luckily for us, many DNA motors do indeed demonstrate this sort of stepwise motion.

One of the subproblems I’ve been helping students with for the past several months is trying to recover a clean stepwise signal from the noisy data. The problem is to summarize the continuous-time current trace by a small number of “segments”, each of which can be summarized by the mean, standard deviation, and duration of the segment.

There were a couple of really bright students working on it, and they had reasonable success, but the methods always seemed a bit fragile—a new set of data taken under different conditions required a lot of tweaking to get the method to segment the data adequately, and sometimes we were unable to find parameters that would work for all parts of the trace—particularly when some parts had higher noise levels than others, or much shorter durations for the fragments.

Since the data is not mine to share, I made up some fake data that looks sort of like a nanopore trace (but cleaner—no 60Hz noise, no baseline drift, just clean steps between Gaussian distributions):

This is a fake nanopore trace that has been segmented by my program.  The light green trace is the raw data, the red vertical lines are the breakpoints and the dark blue lines are the means for each segment.   The noise levels for each segment are realistic, as are the durations and separations between the steps.  The segmentation is almost perfect (there are a couple of very tiny extra segments added).

This is a fake nanopore trace that has been segmented by my program. The light green trace is the raw data, the red vertical lines are the breakpoints and the dark blue lines are the means for each segment. The noise levels for each segment are fairly realistic, as are the durations and separations between the steps. The segmentation is almost perfect (there are a couple of very tiny extra segments added).

The method we were using was a fairly straightforward one: use a low-pass filter to remove noise, then look for large changes in the filtered signal to determine that a breakpoint had occurred, and look for a peak in the derivative of the filtered signal to place the breakpoint precisely. This is a fairly standard technique, and it works well on some signals. But it wasn’t working well for us on some of our data, because some of the time the signal changes very slowly, and sometimes it changes very fast. Tweaking the filter and threshold parameters helped, but when we got adequate segmentation on one part of the signal we either over-segmented some parts or missed segments (or both). We were coming up with more and more baroque schemes to post-process the segments, and still not getting something that was clean enough to pass on to the machine-learning algorithms that were supposed to do the actual analysis we needed.

Finally this week I did what I should have done months earlier—I spent some time trying to figure out what had been published in the past. Segmentation is not a new problem, and it has gone by many names—figuring out what names to look for it under was a big part of the problem. Finally I hit on “change-point detection” and found a free PDF textbook: Detection of Abrupt Changes: Theory and Application by Michèle Basseville and Igor V. Nikiforov.  I only read the first two chapters of the book, when I realized that everything we had done so far was a variant of the “filtered derivative” technique, and that there was a large class of techniques that we hadn’t tried.  The other techniques are based on maximum-likelihood estimates of where the breakpoints are, which is a comfortable framework for me to work in.  The authors were talking mainly about “online” methods (which detect the change after only a small delay) and for the first chapters at least were talking only about stepwise changes in mean with a constant standard deviation.

I grabbed onto one idea fairly quickly—trying to determine whether there was a step in a given time interval and, if so, where it occurred. (This is Section 2.6 of their book: Off-line Change Detection.)  The idea is to have a parameterized probabilistic model for a segment, Pθ(s:t), where θ is some parameter setting for the model optimized for the half-open time interval [s,t).  We compare the probability of the segment modeled as two subsegments with the segment modeled as a single unit:

L_i=\log \frac{P_{\theta_{1}}(s:i) P_{\theta_{2}}(i:t)}{P_{\theta_{0}}(s:e)} = \log P_{\theta_{1}}(s:i) + \log P_{\theta_{2}}(i:t) - \log P_{\theta_{0}}(s:t),

where i is the breakpoint, θ1 and θ2 are the parameters for the two segments, and θ0 is the parameter setting for the whole segment

For a stepwise model, the probabilistic model can be just an independent, identically distributed model for the samples in the segment, with each sample chosen from a Gaussian distribution. The parameters for each model are then just the mean μ and standard deviation σ for the Gaussian, and we can get the maximum-likelihood estimates easily from the samples in the segment.  (In the implementation, I actually keep cumulative sums of x and x2 in order to compute mean and variance quickly for any interval.) Note that with this choice of model, the log-likelihood ratio Li is always non-negative if we choose maximum-likelihood estimates for each segment, since we could get 0 if we chose (possibly non-optimal) parameters on top θ120.  That means that we can always model the data more closely if we add more breakpoints.

The probability of a segment in this model just the product of the probabilities of the samples in the segment, and the log probability of the segment is the sum of the log probabilities of the samples.  So what is the log probability of a single sample from a Gaussian distribution?

\ln \left( \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \right) = -0.5 \ln(2\pi) -\ln(\sigma) - \frac{(x-\mu)^2}{2\sigma^2}

Now comes the magic.

When we chose the optimal μ and σ for a segment, we did it so that σ2 is the variance, that is the average value of (x-\mu)^2. So when we sum the log probabilities over a segment of length n, using μ and σ optimized for that segment, we get n( -0.5 \ln(2\pi) -\ln(\sigma) - 0.5).

That means that the log probability ratio we want simplifies to L_i = (s-i)\ln \sigma_{1} + (i-t)\ln \sigma_{2} - (s-t)\ln \sigma_{0}.

To find the best breakpoint in an interval, we check each possible breakpoint, and pick the one that maximizes Li. We then decide whether that Li is big enough to keep the breakpoint or not using a user-settable threshold. If the threshold is set very low, then the user is saying that there is not much noise and we should follow the signal closely. If the threshold is set high, then the user is saying that there is a lot of noise and we need to be very confident of any breakpoint we claim.

With this simple method for finding a single breakpoint, I wrote an algorithm for recovering a series of steps, with three parameters: the threshold for keeping a breakpoint, a minimum segment duration m, and a window width w.

Here’s how it works for segment(s:t)

  1. I look at a window starting at the beginning of the data (s:s+w), and see if there is a keepable breakpoint in that window (only examining possible breakpoints in (s+m:s+w–m) to avoid creating a too-short segment).
  2. If there is a keepable breakpoint at i, I recursively look at the data on either side, returning segment(s:i) , i, segment(i:t).
  3. If there is no keepable breakpoint I advance the window halfway and try again.  I do this iteratively, but it is equivalent to returning segment(s+w/2:t).

In the current implementation, I have another parameter for the maximum segment duration, and I force in breakpoints if I haven’t seen a keepable one for a long enough interval, but I think that was an unnecessary complication, and I’ll probably remove it from an optimized implementation.

One cool thing about this stepwise segmenter is that it does not need to have any change in the mean values of the segments—a change in the variance suffices.  I made fake data which consisted of 1-second long (10,000 samples) of Gaussian distributions, with each successive second having a standard deviation that was 90% of the previous second.  With a little tweaking to get the right range of thresholds for keeping breakpoints, the algorithm had no trouble segmenting the data:

No change in means is needed for segmenting data.  A 10% change in standard deviation is enough (if the segments are long enough).

No change in means is needed for segmenting data. A 10% change in standard deviation is enough (if the segments are long enough).  The boundaries are hard to find by eye, but the algorithm has no trouble nailing them to within a few hundred samples (tens of milliseconds).

Of course, finding real segments is nice, but what if the standard deviation changes smoothly? Do we still end up segmenting the data?

Yes, even if the standard deviation changes smoothly, the algorithm will still break the trace into segments.  The exact placement of the breakpoints is a bit random, but the average length of the segments depends on the threshold that is chosen.

Yes, even if the standard deviation changes smoothly, the algorithm will still break the trace into segments. The exact placement of the breakpoints is a bit random, but the average length of the segments depends on the threshold that is chosen.  So we can’t claim that because the algorithm has chosen a breakpoint that there is really an abrupt change in characteristics there—there may just have been enough accumulated change to say that what is after the break is not the same as what was before the break.

My algorithm is not quite an online algorithm, since it re-examines data after setting a breakpoint, but it never backtracks further than the window width, so there is limited delay in the decision making. I’ve found it useful to use fairly wide windows (about 30,000 data points or 3 seconds at 10kHz), even when the steps in the data are pretty short (3–100 msec).

Having to choose a threshold for when to insert a breakpoint is a bit of a nuisance, and much of Detection of Abrupt Changes: Theory and Application seems to be dedicated to various schemes for setting thresholds in robust ways that depend on the lengths of the steps, but there is always a parameter that has to be set by the user to indicate what level of detail is the interesting signal, so I’m not sure that there is a lot to gain for our purposes in the fancier schemes.

The magic that simplifies the log-likelihood function does not just apply to stepwise models.  Any model that can be represented as D(t) + \epsilon(0,\sigma), a deterministic function of t plus an independent Gaussian noise term with 0 mean, will have the same property.  I did another implementation of the segmenter that does linear regression on each segment, producing segments that have slanting line segments (not piecewise linear, since there is no continuity at segment boundaries).  This may turn out to be useful for current traces for some molecules, which do not seem to have a stepwise pattern, but ramping up or ramping down of the current. Here is an example of a fake trace made with sloping segments, reconstructed by the stepwise algorithm and the slanted-steps one:

Fake current trace with slanted segments requires a lot of steps to model the data as a stepwise signal.

Fake current trace with slanted segments requires a lot of steps to model the data as a stepwise signal.

Starting from the same deterministic underlying model (but different random noise), modeling the signal with stepwise slanted line segments does not need nearly as many breakpoints.

Starting from the same deterministic underlying model (but different random noise), modeling the signal with stepwise slanted line segments does not need nearly as many breakpoints.  All the reconstructed breakpoints here are within 3 samples (0.3msec) of the breakpoints in the original function, mainly because the segments were all separated by big steps.

The slanted-steps algorithm is slightly slower to compute than the stepwise algorithm, but even with my 4-year-old laptop computer I can do the computations nearly in real time for 10kHz signals in unoptimized Python.  Rewriting in C should run at least 10 times faster than the current Python implementation, and may be able to go 100 times faster with careful coding.  Of course, we have to choose whether the stepwise or slanted steps function are more useful representations of the underlying data.  If we really have a stepwise signal, the slanted steps could merge adjacent steps into a slope and obscure the signal.

« Previous Page

%d bloggers like this: