Today’s post is not about what or how I teach, but about some of the research I’ve been doing lately. I’ve been collaborating with another faculty member and his students on analyzing nanopore data.
The basic idea of a nanopore is simple: you have a thin membrane separating two salt-water baths, and stick a very tiny hole in it. You put a voltage difference between the two baths and measure the current through the hole. If you can arrange for large molecules to pass through the hole, they block some of the current flow, and you can monitor the passage of the molecules through the holes by monitoring the current.
In my collaborator’s lab, they use a lipid bilayer (similar to a cell membrane) as the separating membrane, and a porin protein to punch a hole in the membrane. (They used to use α-hemolysin as the nanopore, but are now trying MSP-A, which has some better properties.) Mostly they pass DNA through the pore, since funding is most available for DNA sequencing technologies, but they’ve also explored other large biomolecules.
One problem with nanopores, as with almost all single-molecule techniques, is that the signals are really small, nearly buried in noise (see my post Noise in nanopores). We are often looking for changes of less than a pA (that’s 10–12 amps) with noise that may be 2pA RMS or more.
When things are working the way we want, the large molecule moves stepwise through the pore. For example, with DNA in the pore, we want to attach some sort of DNA motor to pull the DNA through. But we don’t want a smooth, continuous motion—we want a motor that advances the DNA by one nucleotide rapidly, then stays there for a long time so we can have a constant current to identify what is in the pore, then advances again by one and stays there, and so forth. Luckily for us, many DNA motors do indeed demonstrate this sort of stepwise motion.
One of the subproblems I’ve been helping students with for the past several months is trying to recover a clean stepwise signal from the noisy data. The problem is to summarize the continuous-time current trace by a small number of “segments”, each of which can be summarized by the mean, standard deviation, and duration of the segment.
There were a couple of really bright students working on it, and they had reasonable success, but the methods always seemed a bit fragile—a new set of data taken under different conditions required a lot of tweaking to get the method to segment the data adequately, and sometimes we were unable to find parameters that would work for all parts of the trace—particularly when some parts had higher noise levels than others, or much shorter durations for the fragments.
Since the data is not mine to share, I made up some fake data that looks sort of like a nanopore trace (but cleaner—no 60Hz noise, no baseline drift, just clean steps between Gaussian distributions):
The method we were using was a fairly straightforward one: use a low-pass filter to remove noise, then look for large changes in the filtered signal to determine that a breakpoint had occurred, and look for a peak in the derivative of the filtered signal to place the breakpoint precisely. This is a fairly standard technique, and it works well on some signals. But it wasn’t working well for us on some of our data, because some of the time the signal changes very slowly, and sometimes it changes very fast. Tweaking the filter and threshold parameters helped, but when we got adequate segmentation on one part of the signal we either over-segmented some parts or missed segments (or both). We were coming up with more and more baroque schemes to post-process the segments, and still not getting something that was clean enough to pass on to the machine-learning algorithms that were supposed to do the actual analysis we needed.
Finally this week I did what I should have done months earlier—I spent some time trying to figure out what had been published in the past. Segmentation is not a new problem, and it has gone by many names—figuring out what names to look for it under was a big part of the problem. Finally I hit on “change-point detection” and found a free PDF textbook: Detection of Abrupt Changes: Theory and Application by Michèle Basseville and Igor V. Nikiforov. I only read the first two chapters of the book, when I realized that everything we had done so far was a variant of the “filtered derivative” technique, and that there was a large class of techniques that we hadn’t tried. The other techniques are based on maximum-likelihood estimates of where the breakpoints are, which is a comfortable framework for me to work in. The authors were talking mainly about “online” methods (which detect the change after only a small delay) and for the first chapters at least were talking only about stepwise changes in mean with a constant standard deviation.
I grabbed onto one idea fairly quickly—trying to determine whether there was a step in a given time interval and, if so, where it occurred. (This is Section 2.6 of their book: Off-line Change Detection.) The idea is to have a parameterized probabilistic model for a segment, Pθ(s:t), where θ is some parameter setting for the model optimized for the half-open time interval [s,t). We compare the probability of the segment modeled as two subsegments with the segment modeled as a single unit:
where i is the breakpoint, θ1 and θ2 are the parameters for the two segments, and θ0 is the parameter setting for the whole segment
For a stepwise model, the probabilistic model can be just an independent, identically distributed model for the samples in the segment, with each sample chosen from a Gaussian distribution. The parameters for each model are then just the mean μ and standard deviation σ for the Gaussian, and we can get the maximum-likelihood estimates easily from the samples in the segment. (In the implementation, I actually keep cumulative sums of x and x2 in order to compute mean and variance quickly for any interval.) Note that with this choice of model, the log-likelihood ratio Li is always non-negative if we choose maximum-likelihood estimates for each segment, since we could get 0 if we chose (possibly non-optimal) parameters on top θ1=θ2=θ0. That means that we can always model the data more closely if we add more breakpoints.
The probability of a segment in this model just the product of the probabilities of the samples in the segment, and the log probability of the segment is the sum of the log probabilities of the samples. So what is the log probability of a single sample from a Gaussian distribution?
Now comes the magic.
When we chose the optimal μ and σ for a segment, we did it so that σ2 is the variance, that is the average value of . So when we sum the log probabilities over a segment of length n, using μ and σ optimized for that segment, we get .
That means that the log probability ratio we want simplifies to .
To find the best breakpoint in an interval, we check each possible breakpoint, and pick the one that maximizes Li. We then decide whether that Li is big enough to keep the breakpoint or not using a user-settable threshold. If the threshold is set very low, then the user is saying that there is not much noise and we should follow the signal closely. If the threshold is set high, then the user is saying that there is a lot of noise and we need to be very confident of any breakpoint we claim.
With this simple method for finding a single breakpoint, I wrote an algorithm for recovering a series of steps, with three parameters: the threshold for keeping a breakpoint, a minimum segment duration m, and a window width w.
Here’s how it works for segment(s:t)
- I look at a window starting at the beginning of the data (s:s+w), and see if there is a keepable breakpoint in that window (only examining possible breakpoints in (s+m:s+w–m) to avoid creating a too-short segment).
- If there is a keepable breakpoint at i, I recursively look at the data on either side, returning segment(s:i) , i, segment(i:t).
- If there is no keepable breakpoint I advance the window halfway and try again. I do this iteratively, but it is equivalent to returning segment(s+w/2:t).
In the current implementation, I have another parameter for the maximum segment duration, and I force in breakpoints if I haven’t seen a keepable one for a long enough interval, but I think that was an unnecessary complication, and I’ll probably remove it from an optimized implementation.
One cool thing about this stepwise segmenter is that it does not need to have any change in the mean values of the segments—a change in the variance suffices. I made fake data which consisted of 1-second long (10,000 samples) of Gaussian distributions, with each successive second having a standard deviation that was 90% of the previous second. With a little tweaking to get the right range of thresholds for keeping breakpoints, the algorithm had no trouble segmenting the data:
Of course, finding real segments is nice, but what if the standard deviation changes smoothly? Do we still end up segmenting the data?
My algorithm is not quite an online algorithm, since it re-examines data after setting a breakpoint, but it never backtracks further than the window width, so there is limited delay in the decision making. I’ve found it useful to use fairly wide windows (about 30,000 data points or 3 seconds at 10kHz), even when the steps in the data are pretty short (3–100 msec).
Having to choose a threshold for when to insert a breakpoint is a bit of a nuisance, and much of Detection of Abrupt Changes: Theory and Application seems to be dedicated to various schemes for setting thresholds in robust ways that depend on the lengths of the steps, but there is always a parameter that has to be set by the user to indicate what level of detail is the interesting signal, so I’m not sure that there is a lot to gain for our purposes in the fancier schemes.
The magic that simplifies the log-likelihood function does not just apply to stepwise models. Any model that can be represented as , a deterministic function of t plus an independent Gaussian noise term with 0 mean, will have the same property. I did another implementation of the segmenter that does linear regression on each segment, producing segments that have slanting line segments (not piecewise linear, since there is no continuity at segment boundaries). This may turn out to be useful for current traces for some molecules, which do not seem to have a stepwise pattern, but ramping up or ramping down of the current. Here is an example of a fake trace made with sloping segments, reconstructed by the stepwise algorithm and the slanted-steps one:
The slanted-steps algorithm is slightly slower to compute than the stepwise algorithm, but even with my 4-year-old laptop computer I can do the computations nearly in real time for 10kHz signals in unoptimized Python. Rewriting in C should run at least 10 times faster than the current Python implementation, and may be able to go 100 times faster with careful coding. Of course, we have to choose whether the stepwise or slanted steps function are more useful representations of the underlying data. If we really have a stepwise signal, the slanted steps could merge adjacent steps into a slope and obscure the signal.