In Plagiarism detected, I mentioned that an article in* Nature Biotechnology* plagiarizes from my blog, specifically Supplementary Material page 6 from Segmenting noisy signals from nanopores. I got email from the last author this week, explaining the situation:

We saw your recent blog post about our paper and feel that we owe you an explanation.

At the time we read your level-finding blog post we had already implemented a recursive level-finding algorithm that we have been using in our lab. Our algorithm made comparison of two data segments using a T-test. We came across your blog and found that the logP value was more useful than the T-test. We wanted to cite your blog, but Nature’s online publication guidelines made it seem that “Only articles that have been published or submitted to a named publication should be in the reference list” (http://www.nature.com/nature/authors/gta/#a5.4). While we wanted to present our methods as transparently as possible, we had no intention of claiming your work as ours.We should have made efforts to contact you and NBT editors about how to best cite your contribution.

I have contacted NBT to see if a post-publication citation to your blog can be made and I will keep you posted on this.

We noted your recent BioarXiv manuscript and will refer to it in future publications using logP-test level-finders.

So one of the two corrections I was seeking has been met (an apology from the authors), and the other (a citation to the blog) is being sought by the authors. It seems that *Nature* has a very poor policy about citations, discouraging correct attribution. Yet another reason to consider them a less desirable family of journals (their rip-off pricing for libraries and their preference for sensational articles over careful research are others).

On a related front, referees for our journal submission of the segmenter paper pointed out that several of the ideas are not new (hardly surprising), and that the basic algorithm has been around for quite a while. They pointed us to a paper by Killick, Fearnhead, and Eckley (http://arxiv.org/pdf/1101.1438.pdf), which supposedly has an exact algorithm that is as efficient as binary segmentation (which only approximates the best breakpoints). I thank the referees for the pointer—that is the sort of thing peer review is supposed to be good for: pointing out to authors where they have missed relevant prior literature.

I’ve only glanced through the paper (I had 16 senior theses to grade in 4 days, plus trying to get a new draft of my book for my applied electronics course done in time for classes starting next Monday), so I can’t say anything about the algorithm they present, but they do give a citation for the binary algorithm that dates back to 1974:

Scott, A. J. and Knott, M. (1974). A cluster analysis method for grouping means in the analysis of variance.

Biometrics, 30(3):507–512.

The online version of the journal only goes back to 1999, so I’ve not confirmed that the paper does contain the same algorithm, but it would not surprise me if it did—the binary split method is fairly obvious once the basics of splitting on log-likelihood are understood. I had looked for papers on the technique and not found them (which surprised me), but I didn’t look as hard as I should have. I did not find the right entry points to the literature—it is scattered over many different disciplines and I relied too much on the one textbook that I did find to give me pointers. And I didn’t read all the textbook, so I may have missed the appropriate pointers—though they do not cite Scott and Knott, so maybe the textbook authors missed an important chunk of the literature, too.

Now that the Killick *et al.* paper has given me some useful pointers, I have a lot of reading to do. I don’t know if I’ll have time before the summer, though—my teaching load starting next week is pretty heavy (I was just noticing that my calendar had 24.5 hours scheduled for the first week, not counting time for prepping for classes, setting up the lab, grading, or revising the book for the electronics class: 7 hours of lecture, 12 hours of lab class, 2 office hours, 1.5 hours meeting with the department manager, 2 hours faculty meeting—and the dean wants to meet with me for half an hour sometime also).

Given that the main idea in our segmenter paper is an old one, for it to be salvageable, we’ll have to shrink the basic algorithm to a brief tutorial (with citations to prior inventors) and concentrate on the little changes made after the basic idea: the parameterization of the threshold setting and the correction for low-pass filtering. There may be a little bit for applying the idea to stepwise slanting segments using linear regression, but I bet that idea is also an old one, buried somewhere in the literature.

This summer I may want to look at implementing the ideas of the Killick *et al.* paper (or other similar approaches), to see if they really do produce better segmentation as quickly.