It is taking much longer than I thought to do the grading—it’s almost midnight and I’m still only half done with the grading for tomorrow.
The problem is with the design of the assignment—this is the first time I’ve given the assignment to do conversion from FASTA to FASTQ and from FASTQ to FASTA, and I had not carefully thought through all the problems that might arise. I should have given it several more weeks of thought, but I was running behind on setting up the course, largely because of the amount of time I’d been putting into the design of the circuits course. I ended up slapping the assignment together quickly, and just checking that it was doable, not polishing the assignment handout or building a good test suite.
It is often this way with the first run of an assignment, and it is frustrating both for me and for the students. That’s one reason I try to introduce only one or two new assignments each year, as it often takes a few passes with feedback from different groups of students to knock the rough edges off a programming assignment.
Here are some of the problems with the assignment:
- I did not provide a specific enough output spec to uniquely determine the output file. This means that I had to read the output with a somewhat forgiving program to compare it to the input, not just use diff. Writing that comparison program took time, and it still does not detect some of the things that can go wrong (it is a bit too forgiving).
- I relied on the students reading the Wikipedia description of the FASTA and FASTQ formats, but the description there did not really cover all the common variants (which include extra returns within sequences and quality sequences), so some students wrote very fragile parsers. I’ll have to find or write clearer input specs for next year.
- I did not provide a sufficient set of test files to detect problems. I picked up a few test files from the students themselves, but I only discovered the need for several of the tests on reading the student code and thinking of ways that it didn’t quite work. I would think up ways to break a particular student’s fragile code, then have to apply that test to everyone else’s code to be fair.
- I’m still not satisfied with my test suite, because the conversion task does not really test whether the input parser is working correctly. The input parser needs to remove white space from the sequences, to provide pure DNA or protein sequences internally, but spaces and returns in the output are legal, so blind copying without proper parsing is not easily detected in this task. Some of the best conversion output came from students who had not properly represented sequences internally, which means I need to do deep reading of the code to see if there are problems, and can’t rely on the I/O test to uncover them.
For next year, I think I want to keep the input-parser assignment, but I think I need to change the task that the parser is used for to one that detects more reliably whether the students have correctly implemented the parser and that provides a unique correct output—conversion between the formats is a useful program to have, but not a good test of the parser (although I had initially thought it would be).
One of the students anonymously commented on my previous post:
The choice for the 2nd assignment was pretty unfortunate, IMHO. It was very tedious/time consuming, spec was frustrating, with limited opportunities to learn/do something interesting. I suspect most people won’t roll their own standard format parsers, but rather use a library/package.
Looking forward to more interesting homework :)
I agree with the comment about the spec—the specs for FASTA and FASTQ are frustratingly vague and what is acceptable varies from program to program. Unfortunately this is true of almost all bioinformatics formats—even the formats that have official standards often have ambiguities. Although there is pedagogic value in having students realize that format descriptions are often frustratingly incomplete and inadequate, this may not be the best place in the course to have such frustration, as students are still adapting to Python and there is too much else going on in the assignment.
The point about libraries and packages gets at a more fundamental pedagogical point of the course. Libraries and packages don’t appear by magic—someone has to write them, and part of the point of this course is to prepare people to write such libraries. Yes, lots of people have written fasta and fastq parsers before, some of which are good and some of which are incredibly fragile or disgustingly slow. Most people don’t discover that fragility until the package breaks in the middle of some urgent project. Having people write their own parsers for “easy” formats gives them a greater appreciation for good packages and some idea what to test in a package before relying on it.
Since many of these grad students will be going on to create new tools that do new tasks, I also want them to think about file formats in their new tools before creating yet another vaguely specified and potentially ambiguous format. There is a great tendency for students (whether from a computer science or biology background) to slap together a format that is good enough for the example they are working on, without thinking through what would happen if people started using it for other purposes. Both FASTA and FASTQ have this slapped-together feel, and it causes pain for bioinformaticians on a regular basis. A lot of data wrangling involves dealing with data that doesn’t quite fit in some format and the various incompatible kluges people have used to try to force the format to represent the data anyway.
As for the time-consuming aspect, that is also partly the result of this being the first time for this assignment. My assignments are usually somewhat time-consuming, but this one was a little bigger than I thought. For next year, I’ll probably do only one of the parsers (not sure yet whether the fasta+qual or fastq is the better one to require), but require some manipulation of the sequences, like end-trimming of sequences to remove low-quality tails. That would cut the amount of code almost in half, while providing a better test that students had gotten the semantics right.