This is the fourth in a series—the three previous are
- Segmenting noisy signals from nanopores
- More on segmenting noisy signals.
- Segmenting filtered signals
A student (well, junior specialist this year, but student again in the fall) and I were trying to write up a journal paper describing the method and implementation, and I realized that the presentation in this blog was not clear enough for him to rewrite the math, and that I had left in some bits of fluff that were confusing. In this blog post I’ll try to merge the three previous posts into a single, coherent presentation using a Bayesian approach.
The segmenter uses a divide-and-conquer approach:
segment( time interval=[s,t) ) if the time t-s is too short, return no breakpoints. Select the best breakpoint, i, to divide into two intervals [s,i) [i,t) If the breakpoint is poor, reject it and return no breakpoints otherwise return segment(s,i) + [i] + segment(i,t)
There are two essential parts to this algorithm: selecting the breakpoint and deciding whether to accept it. Both rely on a Bayesian scoring scheme. For the score, we’ll need two likelihood models, that is computable functions that compute the probability of seeing the observed data if it is generated by the corresponding model. To score a potential breakpoint, i, we compare a model with the breakpoint (so that [s,i) and [i,t) are modeled independently) to one where the entire interval is modeled as a unit.
The simplest model, and one that has worked well for DNA data from the nanopore, is that each segment consists of a constant current with independent Gaussian noise added to each sample. The likelihood for the data is just , where µ and σ are the mean and standard deviation of the Gaussian generating the samples in [s,t). I’ll abbreviate this product as .
The log of the likelihood ratio for a breakpoint at position is ,
where µ0 and σ0 are the mean and standard deviation for the whole window from s to t, µ1 and σ1 are the mean and standard deviation for s to i, and µ2 and σ3 are the mean and standard deviation for i to t. Because we are using maximum likelihood estimates for all the µ and σ values, we are guaranteed that the log-likelihood Li is non-negative.
The log-probability for a single sample with a Gaussian distribution is
When we chose the optimal μ and σ for a segment, we did it so that σ2 is the variance, that is the average value of . So when we sum the log probabilities over a segment of length n, using μ and σ optimized for that segment, we get .
That means that the log likelihood ratio we want simplifies to . This simplification works for any modeling that results in independent Gaussian residuals, so we can do stepwise sloping segments by doing linear regression almost as easily as doing stepwise constant segments.
Choosing the best breakpoint is now easy—we just pick the value for i that maximizes Li.
Deciding whether to keep or reject the breakpoint requires looking not just at the likelihood ratio, but at the posterior-odds ratio, which is the likelihood ratio times the prior odds ratio. The prior odds ratio is a parameter we can set independent of the data. For example, if we have an estimate of the expected number of segments per second (sps) for a particular experimental setup and our sampling frequency is fs, then the prior probability of a step at each sample is , the prior odds ratio is , and the score we want to use (log posterior odds ratio) is .
If we are willing to accept no more than FP false positives per second, then we want the probability of a score arising by chance at a sample to be less than
Although I have no proof for it, if the samples are all drawn independently from a Gaussian distribution, empirically the log-likelihood scores seem to be perfectly distributed on an exponential distribution . Note that this distribution does not depend on the length of the window, μ, or σ, though I did avoid looking at the first and last few positions in the window, in case there are edge effects.
However, if we low-pass filter the Gaussian noise, then the samples are no longer independent, and the log-likelihood scores go up. The distribution with a cutoff frequency of k times the Nyquist frequency is empirically . (The Nyquist frequency is half the sampling frequency and is the highest frequency that can be represented by the samples.)
If these empirically observed distributions are correct, then we have
This is the same result as in More on segmenting noisy signals and Segmenting filtered signals, but there I made the mistake of changing parameterization from FP (false positives per second) to (false positives per expected segment), which makes the parameterization more confusing, and I was not clear about where the scaling went for the filtering.
On the example I’ve been checking, I get good results with sps=10 segments/sec and FP=1e-4 false positives/sec, but neither parameter is very critical—changing them by a factor of 1000 does not change the results much. Dropping FP to 1e-200 does cause segments to be obviously missed.