Gas station without pumps

2012 September 4

Iddo responds

Filed under: Uncategorized — gasstationwithoutpumps @ 21:37
Tags: , , ,

A few days ago, I posted a response to a post by Iddo about accountable research software.

Iddo has a new post that contains some thoughtful reflections on my suggestions: Should research code be released as part of the peer review process?.

I think I’ve convinced Iddo that not all research code should be carefully tested, released code, ready for users to apply blindly.  I think he and I are both in agreement that some code should be polished in that manner, though I remain less convinced than him that the Bioinformatics Testing Consortium will do much to further that aim.

There have been some other responses to Iddo’s original post.

Deepak Singh takes the view that scientists don’t publish their code because they are horrible, lazy programmers, and that if they would just work harder and be trained better then the problem would go away.  If only it were that simple!  He correctly points out that a lot of industrial code is also hacked together for rapid prototyping and used past the point where it should be. He believes, however, that industrial programmers are better at programming than scientists (true in some cases, and perhaps even on average, but there are certainly exceptions).

I remain unconvinced that simply training scientists to be better programmers would make much difference. I’m a pretty good, highly trained programmer (PhD in computer science, over 40 years of programming experience in various languages), but a lot of my research code is not in a state where I’d be willing to distribute it.  Major refactoring is needed of most of the bigger programs, and I don’t have the time, the resources, nor the incentive to do it. I do try to write clean code with decent documentation, but a lot of the bigger projects involved student work that I did not have time to clean up.

Even where the code is all mine, after 5 or 10 years of intermittent work on a project the code has often drifted a long way from the original design, and design decisions that were good ones at the time they were made are no longer good choices—hence the need for major refactoring.

There is some discussion of the need for open source code triggered by Iddo’s article, but the discussion there doesn’t seem to go anywhere.  One commenter advocates open-notebook science—a position I have some agreement with, but I’m often constrained by the fact that the data I’m analyzing is not my data, so I need permission of the data owner before saying anything about it. Because my programming is mostly driven by the data I’m analyzing (looking for patterns and anomalies in data), releasing my notes about what I’m programming and why would release the part of the data that the data owner regards as most precious (the interpretation), even if the data itself is kept hidden.  The commenter would probably claim that the data should be released as soon as it is collected, but that is almost never my decision to make.

2012 August 27

Accountable research software

Filed under: Uncategorized — gasstationwithoutpumps @ 09:03
Tags: , ,

Iddo Friedberg asks the seemingly reasonable question “Can we make accountable research software?” on his blog Byte Size Biology. As he points out, most research software is built by rapid prototyping methods, rather than careful software development methods, because we usually have no idea what algorithms and data structures are going to work when we start writing the code.  The point of the research is often to discover new methods for analyzing data, which means that there are a lot of false starts and dead ends in the process.

The result, however, is that research code is often incredibly difficult to distribute or maintain.  Like some others in the bioinformatics community, he feels that the solution is for code to be rewritten and carefully tested before publication of results.  He is aware of at least one of the reasons this is not currently done—it is damned expensive and funding agencies have shown almost no willingness to support rewriting research code into distributable code (I know, as I’ve tried to get funding for that).

The rapid prototyping skills needed for research programming and the careful specification, error checking, and testing needed for software engineering are almost completely disjoint. Some people would even argue that the thinking styles needed for the two types of programming are incompatible.  I wouldn’t go quite that far, but they are certainly very different modes of programming.  It will often be the case that different programmers need to be hired for developing research code and for converting it into distributable code.

The “solution” that Iddo proposes, passed on from Ben Templeton, is the Bioinformatics Testing Consortium, which is a volunteer group of researchers to do some of the quality assurance (QA) steps of software development for each other (code review and testing).  Quite frankly, I don’t see this as being much of a solution.  First, the software has to be in a nearly finished, polished state before the QA steps that they propose make much sense—and getting the code to that state is 90% of the problem.  Second, the volunteer nature of the consortium could easily result in the “tragedy of the commons”, where everyone wants to take more out of the system than they put in.  This is already happening in peer review of papers, with people writing more papers than they review, with the result that editors are finding it harder and harder to get competent reviewers. Third, the people involved are either going to be careful software developers (who are not the main problem in undistributable research code) or rapid prototypers who don’t have the patience and methodical approach of professional testers.

Note: I think that the Bioinformatics Testing Consortium is a good idea. Like many other volunteer projects, it is addressing a real need, though only a small part of the need and with inadequate resources.

I do worry a little about one of the justifications given for distributing research code—the need to replicate experiments.  A proper replication for a computational method is not running the same code over again (and thus making the same mistakes), but re-implementing the method independently.  Having access to the original code is then useful for tracking down discrepancies, as it is often the case that the good results of a method are due to something quite different from what the original researchers thought.  I fear that the push to have highly polished distributable code for all publications will result in a lot less scientific validation of methods by reimplementation, and more “ritual magic” invocation of code that no one understands.  I’ve seen this already with code like DSSP, which almost all protein structure people use for identifying protein secondary structure with almost no understanding of what DSSP really does nor exactly how it defines H-bonds.  It does a good enough job of identifying secondary structure, so no one thinks about the problems.

I fear that the push for polished code from researchers is an attempt to replace computational researchers with software publishing teams. The notion is that the product of the research is not the ideas and the papers, but just free code for others to use.  It treats bioinformaticians as servants of “real” researchers, rather than as researchers in their own right.  It’s like demanding that no papers on possible drug leads be published until Phase III trials have been completed (though not quite that expensive), and then that the drug be distributed for free.

Certainly there is a place for bioinformatics as a service—the UCSC genome browser is a good example of such a service, and the team of developers, QA people, and IT people needed to build and maintain such a service is big and expensive—more expensive than the researchers involved in the effort.  There are enough uses and enough users for that service to justify the price, but are if we hold all bioinformatics researchers to that level of code quality, we’ll stifle a lot of new ideas.

Requiring that code be turnkey software before publication is not a desirable goal for bioinformatics as a research community.

Create a free website or blog at WordPress.com.

%d bloggers like this: