# Gas station without pumps

## 2015 March 27

### Followup on plagiarism

Filed under: Uncategorized — gasstationwithoutpumps @ 08:26
Tags: , , ,

In Plagiarism detected, I mentioned that an article in Nature Biotechnology plagiarizes from my blog, specifically Supplementary Material page 6 from Segmenting noisy signals from nanopores. I got email from the last author this week, explaining the situation:

We saw your recent blog post about our paper and feel that we owe you an explanation.

I have contacted NBT to see if a post-publication citation to your blog can be made and I will keep you posted on this.

We noted your recent BioarXiv manuscript and will refer to it in future publications using logP-test level-finders.

So one of the two corrections I was seeking has been met (an apology from the authors), and the other (a citation to the blog) is being sought by the authors. It seems that Nature has a very poor policy about citations, discouraging correct attribution.  Yet another reason to consider them a less desirable family of journals (their rip-off pricing for libraries and their preference for sensational articles over careful research are others).

On a related front, referees for our journal submission of the segmenter paper pointed out that several of the ideas are not new (hardly surprising), and that the basic algorithm has been around for quite a while.  They pointed us to a paper by Killick, Fearnhead, and Eckley (http://arxiv.org/pdf/1101.1438.pdf), which supposedly has an exact algorithm that is as efficient as binary segmentation (which only approximates the best breakpoints). I thank the referees for the pointer—that is the sort of thing peer review is supposed to be good for: pointing out to authors where they have missed relevant prior literature.

I’ve only glanced through the paper (I had 16 senior theses to grade in 4 days, plus trying to get a new draft of my book for my applied electronics course done in time for classes starting next Monday), so I can’t say anything about the algorithm they present, but they do give a citation for the binary algorithm that dates back to 1974:

Scott, A. J. and Knott, M. (1974). A cluster analysis method for grouping means in the analysis of variance. Biometrics, 30(3):507–512.

The online version of the journal only goes back to 1999, so I’ve not confirmed that the paper does contain the same algorithm, but it would not surprise me if it did—the binary split method is fairly obvious once the basics of splitting on log-likelihood are understood.  I had looked for papers on the technique and not found them (which surprised me), but I didn’t look as hard as I should have. I did not find the right entry points to the literature—it is scattered over many different disciplines and I relied too much on the one textbook that I did find to give me pointers. And I didn’t read all the textbook, so I may have missed the appropriate pointers—though they do not cite Scott and Knott, so maybe the textbook authors missed an important chunk of the literature, too.

Now that the Killick et al. paper has given me some useful pointers, I have a lot of reading to do.  I don’t know if I’ll have time before the summer, though—my teaching load starting next week is pretty heavy (I was just noticing that my calendar had 24.5 hours scheduled for the first week, not counting time for prepping for classes, setting up the lab, grading, or revising the book for the electronics class: 7 hours of lecture, 12 hours of lab class, 2 office hours, 1.5 hours meeting with the department manager, 2 hours faculty meeting—and the dean wants to meet with me for half an hour sometime also).

Given that the main idea in our segmenter paper is an old one, for it to be salvageable, we’ll have to shrink the basic algorithm to a brief tutorial (with citations to prior inventors) and concentrate on the little changes made after the basic idea: the parameterization of the threshold setting and the correction for low-pass filtering.  There may be a little bit for applying the idea to stepwise slanting segments using linear regression, but I bet that idea is also an old one, buried somewhere in the literature.

This summer I may want to look at implementing the ideas of the Killick et al. paper (or other similar approaches), to see if they really do produce better segmentation as quickly.

## 2015 March 14

### Plagiarism detected

Filed under: Uncategorized — gasstationwithoutpumps @ 20:33
Tags: , , ,

It has recently come to my attention that an article in Nature Biotechnology: doi:10.1038/nbt.2950 “Decoding long nanopore sequencing reads of natural DNA” plagiarizes from my blog, specifically Supplementary Material page 6 from Segmenting noisy signals from nanopores.  Now, I don’t mind their using my work—I would not have published it in such a public form as posting to my blog if I were trying to keep it secret—but standard scholarly practice requires that sources be cited.  Claiming someone else’s work as one’s own is the academic sin.

I don’t know which of the 13 authors of the Nature Biotechnology authors is the plagiarist, but I hold the head of the lab (Jens Gundlach) responsible for the plagiarism, since it seems clear that he did not bother to check that his students and co-workers were citing others’ work appropriately.  It is the job of the head of a lab to create a culture of proper citation—failure to do so is indication of not doing one’s job as a scholar or as a professor.

I’m undecided about what to do about this plagiarism.  The obvious thing to do would be to complain to the editors, but I have no idea whether that will do any good.  The last time I had a serious plagiarism case like this was when I was in logic minimization, and parts of a paper of mine that had been rejected from the main (almost sole) journal in the field later appeared in a conference article with the editor who had rejected the paper as one of the co-authors.  In that case, complaints to the journal were useless (they just sent the complaints to the editor who had plagiarized from me—thereby ensuring that I would never get any papers published in the field).  I ended up leaving the field in disgust (as several other researchers had done—the field has been pretty stagnant since all new ideas were blocked by the powerful editor) and moving into bioinformatics instead, where rivalries were decided more on the quality of one’s solutions than on publication blocking and theft.

This case is different, though, because the plagiarist is not the editor of the journal, and so the editors may have some leverage to apply to the authors, in order to maintain the credibility of the journal.

The fix I’m looking for is pretty simple one: an apology from Jens Gundlach for not catching and correcting the plagiarism, and adding a citation to my blog to the published article. If they can’t bring themselves to cite me, they could at least cite another source (like Detection of Abrupt Changes: Theory and Application by Michèle Basseville and Igor V. Nikiforov, whom I cited as my inspiration, though Basseville and Nikiforov don’t describe the recursive algorithm I developed).

To complicate matters slightly, I’ve recently submitted a paper to PeerJ based on the same body of work (though including the improved parameterization developed in some of my later blog posts, and including some empirical evidence that the new algorithm works substantially better than filtered-derivative algorithms).  I would not want someone finding Gundlach’s group’s paper and think I had plagiarized from them, rather than them from me.

I ask my readers—how diligently should I pursue this plagiarism case?  Has anyone had any experience with Nature Biotechnology on such matters? Do they care about plagiarism? Or do they make life hell for anyone who brings up the subject?

Update 27 March 2105: I’ve heard from the lead author—see Followup on plagiarism.

## 2014 August 19

### Two papers submitted

I got two papers with a co-author finished and submitted today.

One of them is based on the segmenter algorithms that I wrote up in this blog (with some improved parameterization): Segmenting noisy signals from nanopores, Segmenting filtered signals, and Segmenting noisy signals revisited.

The other one is an HMM paper that uses the automatic segmentation and an HMM that models the DNA motor behavior to do the same analysis that was done by hand in an earlier paper “Error Rates for Nanopore Discrimination Among Cytosine, Methylcytosine, and Hydroxymethylcytosine Along Individual DNA Strands” in PNAS.  The automatic HMM method was more accurate than the hand curation and feature extraction followed by sophisticated machine learning methods like support vector machines and random forests—I think that we were limited before by the biases of the hand curation and feature extraction.

Unfortunately, we were not able to do side-by-side comparison on exactly the same data, as the original data for the PNAS paper was lost in a 2-drive failure on a RAID that had not been properly backed up.

The paper writing did not take much of my time this summer, as my co-author did a lot of the writing and kept me from procrastinating too much.

I’ve also been working on my book, tentatively titled Applied Electronics for Bioengineers, but without a co-author to keep me honest, I’ve been doing a lot of procrastinating.  Yesterday I had planned to show the waveform of the gate signal on a nFET being used for pulse-width modulation, to show the “flat spot” that results from feedback through the gate-drain capacitance.  I fussed with that for quite a while, but never got a recording I liked the looks of—so ended up getting nothing written on the book.  I’ll probably try again tomorrow, but with a backup plan of working on a different section if that one gets too frustrating, or of writing some of the text that will go around the figure, once I get a figure I can use.

The book is currently 215 pages, but that includes the front matter and back matter.  The core of the book is only about 188 pages, with 74 figures.  I probably need a bunch more figures (schematics, graphs, photos, …), but those are what take the longest to produce.  There’s a little over a month left until classes start, but I don’t see myself finishing by then.

## 2014 June 17

### Segmenting noisy signals revisited

This is the fourth in a series—the three previous are

A student (well, junior specialist this year, but student again in the fall) and I were trying to write up a journal paper describing the method and implementation, and I realized that the presentation in this blog was not clear enough for him to rewrite the math, and that I had left in some bits of fluff that were confusing. In this blog post I’ll try to merge the three previous posts into a single, coherent presentation using a Bayesian approach.

The segmenter uses a divide-and-conquer approach:

segment( time interval=[s,t) )
if the time t-s is too short, return no breakpoints.
Select the best breakpoint, i, to divide into two intervals [s,i) [i,t)
If the breakpoint is poor, reject it and return no breakpoints
otherwise return segment(s,i) + [i] + segment(i,t)


There are two essential parts to this algorithm: selecting the breakpoint and deciding whether to accept it.  Both rely on a Bayesian scoring scheme.  For the score, we’ll need two likelihood models, that is computable functions that compute the probability of seeing the observed data if it is generated by the corresponding model.  To score a potential breakpoint, i, we compare a model with the breakpoint (so that [s,i) and [i,t) are modeled independently) to one where the entire interval is modeled as a unit.

The simplest model, and one that has worked well for DNA data from the nanopore, is that each segment consists of a constant current with independent Gaussian noise added to each sample.  The likelihood for the data $x_{s}, \ldots , x_{t-1}$ is just $\prod_{s\leq j < t} P_{\mu,\sigma}(x_{j})$, where µ and σ are the mean and standard deviation of the Gaussian generating the samples in [s,t). I’ll abbreviate this product as $P_{\mu,\sigma}(s:t)$.

The log of the likelihood ratio for a breakpoint at position is $L_{i}=\ln \frac{P_{\mu_{1},\sigma_{1}}(s:i) P_{\mu_{2},\sigma_{2}}(i:t)}{P_{\mu_{0},\sigma_{0}}(s:t)} = \ln P_{\mu_{1},\sigma_{1}}(s:i) +\ln P_{\mu_{2},\sigma_{2}}(i:t)- \ln P_{\mu_{0},\sigma_{0}}(s:t)$,

where µ0 and σ0 are the mean and standard deviation for the whole window from s to t, µ1 and σ1 are the mean and standard deviation for s to i, and µ2 and σ3 are the mean and standard deviation for i to t.  Because we are using maximum likelihood estimates for all the µ and σ values, we are guaranteed that the log-likelihood Li is non-negative.

The log-probability for a single sample with a Gaussian distribution is
$\ln P_{\mu,\sigma}(x) = \ln \left( \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \right) = -0.5 \ln(2\pi) -\ln(\sigma) - \frac{(x-\mu)^2}{2\sigma^2}$
When we chose the optimal μ and σ for a segment, we did it so that σ2 is the variance, that is the average value of $(x-\mu)^2$. So when we sum the log probabilities over a segment of length n, using μ and σ optimized for that segment, we get $n( -0.5 \ln(2\pi) -\ln(\sigma) - 0.5)$.

That means that the log likelihood ratio we want simplifies to $L_{i} = (s-i)\ln \sigma_{1} + (i-t)\ln \sigma_{2} - (s-t)\ln \sigma_{0}$. This simplification works for any modeling that results in independent Gaussian residuals, so we can do stepwise sloping segments by doing linear regression almost as easily as doing stepwise constant segments.

Choosing the best breakpoint is now easy—we just pick the value for i that maximizes Li.

Deciding whether to keep or reject the breakpoint requires looking not just at the likelihood ratio, but at the posterior-odds ratio, which is the likelihood ratio times the prior odds ratio. The prior odds ratio is a parameter we can set independent of the data. For example, if we have an estimate of the expected number of segments per second (sps) for a particular experimental setup and our sampling frequency is fs, then the prior probability of a step at each sample is $\mbox{\it sps}/f_{s}$, the prior odds ratio is $\frac{ \mbox{\it sps} }{f_{s}-\mbox{\it sps}}$, and the score we want to use (log posterior odds ratio) is $L_{i} + \ln \mbox{\it sps} - \ln (f_{s}-\mbox{\it sps})$.

If we are willing to accept no more than FP false positives per second, then we want the probability of a score arising by chance at a sample to be less than $\mbox{FP}/ f_{s}$

Although I have no proof for it, if the samples are all drawn independently from a Gaussian distribution, empirically the log-likelihood scores seem to be perfectly distributed on an exponential distribution $P(L_{i} > x) = e^{-x}$. Note that this distribution does not depend on the length of the window, μ, or σ, though I did avoid looking at the first and last few positions in the window, in case there are edge effects.

However, if we low-pass filter the Gaussian noise, then the samples are no longer independent, and the log-likelihood scores go up. The distribution with a cutoff frequency of k times the Nyquist frequency is empirically $P(L_{i} > x) = e^{-k x}$. (The Nyquist frequency is half the sampling frequency and is the highest frequency that can be represented by the samples.)

If these empirically observed distributions are correct, then we have
$\frac{\mbox{FP}}{f_{s}} > P(\mbox{score} > x) = P\left(L_{i} > x \right) -\ln \mbox{\it sps} + \ln (f_{s}-\mbox{\it sps})= e^{-k x} \left(\frac{\mbox{\it sps} }{f_{s}-\mbox{\it sps}}\right)^{k}$
that is
$x > \ln \frac{\mbox{\it sps}}{f_{s}-\mbox{\it sps}} - \frac{1}{k}\ln\left(\frac{\mbox{FP}}{f_{s}}\right)$.

This is the same result as in More on segmenting noisy signals and Segmenting filtered signals, but there I made the mistake of changing parameterization from FP (false positives per second) to $\mbox{\it over} = \frac{\mbox{FP}}{\mbox{\it sps}}$ (false positives per expected segment), which makes the parameterization more confusing, and I was not clear about where the scaling went for the filtering.

On the example I’ve been checking, I get good results with sps=10 segments/sec and FP=1e-4 false positives/sec, but neither parameter is very critical—changing them by a factor of 1000 does not change the results much.  Dropping FP to 1e-200 does cause segments to be obviously missed.

## 2014 May 24

### Segmenting filtered signals

Last August, I posted on Segmenting noisy signals from nanopores, and in January I updated that with improved parameterization in More on segmenting noisy signals.  Last week, Jacob Schreiber (the programmer for the nanopore group) found some problems with the method. He had re-implemented the method in Cython (rather than Python, as I had prototyped it), and got the same results as me when applied to the same examples I had used. But when he applied it to some new data, he got serious oversegmentation, with hundred of segments chosen where only one should have been.

I tracked down the problem to a difference in how we were applying the method.  I was applying it to data that had been low-pass filtered at 5kHz, recorded at a sampling frequency of 100kHz, then downsampled to 10kHz.  He took the same data without downsampling, then further filtered it with a low-pass filter of 1kHz or 20kHz.  I confirmed that the same data would be correctly segmented with my downsampling and severely oversampled with his extra filtering.  He conjectured that the problem was with the sampling frequency and that the problem could be fixed by just changing the desired gain in log Prob based on sampling frequency.

His argument didn’t seem quite right, as the underlying math was not dependent on sampling frequency.  What it did assume was that the samples within a segment were independent Gaussian-distributed values (more correctly, that the difference between the samples and the model being fitted was Gaussian).  Putting in a low-pass filter removes the independence assumption.  Indeed, with a very low cutoff frequency you expect to see the values changing only slowly, so the segmenter would see it as highly desirable to chop the signal up into short segments, each of which has low variance.

I confirmed that I could get serious oversegmentation even with very low sampling frequency, if the filter cutoff was set much lower.  I got reasonable segmentation results with only minor variations across a wide range of sampling frequencies if the filter cutoff frequency was close to the Nyquist frequency (half the sampling frequency), but oversegmentation if I filtered to 0.1 Nyquist frequency.

There are two ways to address the problem:

• When filtering, always resample at twice the cutoff frequency.
• Adjust the thresholds used according to the cutoff frequency and sampling frequency.

Adjusting the thresholds is a cleaner approach, if it can be made to work.  Today I I tried filtering Gaussian noise with a 5th-order Bessel filter (implemented by scipy.signal.lfilter) and looking at the distribution of the log-odds that we threshold:

$L_{i}=\log \frac{P_{\theta_{1}}(s:i) P_{\theta_{2}}(i:t)}{P_{\theta_{0}}(s:e)} = \log P_{\theta_{1}}(s:i) + \log P_{\theta_{2}}(i:t) - \log P_{\theta_{0}}(s:t)$,

where i is the breakpoint, θ1 and θ2 are the parameters for the two segments, and θ0 is the parameter setting for the whole segment.  For stepwise Gaussian models this simplifies to $L_{i} = (s-i)\ln \sigma_{1} + (i-t)\ln \sigma_{2} - (s-t)\ln \sigma_{0}$.   (To avoid the extra computation of taking square roots, the segmenter actually computes twice this, using variances instead of standard deviations.)

I found that the distributions were still very well fit by exponential distributions, but that the scaling factor changed with the cutoff frequency.   Furthermore, the log-odds seemed to scale linearly with the ratio of Nyquist frequency to cutoff frequency.  Reversing that scaling makes the curves superimpose nicely:

If I scale the log-odds by filter cutoff frequency as a fraction of the Nyquist frequency, the distributions seem to be almost identical. Unfiltered Gaussians have slightly lower log-odds values. Filtering, then downsampling by averaging blocks of 10 samples has slightly higher log-odds, probably because the averaging does a little more low-pass filtering.

Each curve of the plot was computed from only 1000 sequences of 10000 samples, so the results are vary a lot in the tail, due to not having enough data to get the tail exactly.

Because the segmentation method is not very sensitive to the threshold set, simply scaling thresholds by Nyquist frequency/cutoff frequency should work well.   The default parameters had been set for the downsampled 5kHz, so to match the old results in signals  that aren’t downsampled, the thresholds should probably be scaled by about 1.1*Nyquist frequency/cutoff frequency.  (The analog filter cutoff frequency is probably in the metatdata in the data files, but the abf format is messy enough that I’ve not tried to extract it.  The lower cutoff frequency of the analog filter or a subsequent digital filter should be used.)

Next Page »