One of our senior grad students sent me an e-mail today:
Subject: Pedagogy question
Date: Thu, 6 Dec 2012 22:07:34 -0800
I was talking about 205 with one of the first year students and it
occurred to me that I didn’t know how you tested people’s code. The
student was sure that you actually read the code and I always imagined
you had a pipeline that used unit testing on the submissions and only
read the code when things went wrong.
How do you do it? Thanks
The answer is that for most assignments I do both. I have a Makefile that runs each student’s assignment on a number of test cases, and compares the output to the desired output. For some assignments, where the output is not unique (for example, in sequence alignment, where there may be multiple correct alignments, I do some postprocessing of the outputs to get comparable values (for alignments, I rescore their output alignments and compare sequence lengths and scores, rather than the alignments themselves). I’m doing simple I/O testing, not unit tests, as unit tests need to be designed into the programs as they are built. I’m not giving the students scaffolds to build their programs around, not even library specifications, so I can’t have a generic set of unit tests for their programs.
I often have to put student-specific code into the Makefile to correct for minor student errors, like misnaming a command-line option for the program or having set the default values for options wrong. Sometimes I have to make a copy of a program and edit it (for example, to remove MS-DOS carriage-return characters from the source code) before I can get it to run, but I try to avoid doing that after the first couple of assignments.
Some assignments do not lend themselves to simple I/O comparisons (such as the simulation tests to determine the frequency of long ORFs on the reverse strand of a gene). For those, I don’t run the programs, but rely on the student-reported histograms and model fits for the output check.
Without the I/O tests, it would be very difficult for me to grade the programs well, as there are often subtle bugs in student code that can only be teased out by testing unusual inputs—I certainly don’t have the days it would take to do a thorough reading and debugging of each student’s code. Each year I try to improve at least one of the automated tests to do a more thorough job of testing—sometimes this improvement involves modifying the assignment to provide more clearly specified or more testable formats, sometimes it is just the addition of some more corner cases to the test suite, and sometimes it is better postprocessing to do more thorough checking of the output. I reuse nearly the same assignments each year (though 2 assignments this year were used for the first time), so that the testing improves gradually from year to year.
After doing the I/O checks, I then read all the programs. I start with the programs that did not pass I/O tests, trying to provide debugging help. That debugging help is usually the most time-consuming part of the grading. On all the programs I try to provide some feedback on the documentation (which is usually awful, though I had one student this year who really got the point of documentation and did a better job of it than I usually do).
In addition to the documentation feedback, I try to provide suggestions to make the Python more efficient or more idiomatic. For example, I had one student this year who always used complicated “while” loops where simple “for” loops would have been much easier to read, and lots of times people would write multi-line loops than could be more easily expressed with a list comprehension. Students who wrote very good, idiomatic Python got many fewer comments from me than students who were struggling, though I always tried to find at least one useful thing to say to them.
The code-reading is rarely fun—maybe 10% of the students program well enough to make reading their code pleasant rather than painful, but I think that the feedback the students get from it is one of the most important parts of the course pedagogically. Most of the students have never had such detailed feedback on their programming, because it is expensive to provide. It takes me an average of 10–20 minutes per student to do the code reading (depending on the complexity of the assignment), which is barely feasible for a class of 20 once a week and would be prohibitive for a larger class or with more frequent assignments. The I/O checks generally only take about 10 minutes per student, and could be reduced substantially if I just failed students who didn’t pass the I/O checks, rather than trying to patch my test or their code to allow almost working code to be tested.
I spend much more time on the weaker programmers than on the good ones, which may be why the senior grad student does not remember much of the feedback (he also took the course when we were still using Perl, rather than Python, and I provided less feedback on good programming style and idioms—it is damned hard to write a good program in Perl, and many Perl idioms are nasty, ugly hacks).
By the end of the course, everyone who passes is at least a minimally competent Python programmer and can write the sort of data-wrangling code that every computational biologist needs. I think that this year students all ended up with a decent grasp of using tuples as keys or values for hashes and of using generators (made with “yield” statements) for doing input parsing.
Several of them also learned how to collaborate without copying. I had several students who talked to each other about each assignment, helping each other learn Python and the algorithms of the course, while writing distinctly separate code. They were also very good about acknowledging their collaborators—a habit that more courses should be trying to develop. There was only one student I had to chide about copying code he did not understand (and that wasn’t working anyway), and even he had properly cited the original author of the code, so was not in any danger of an academic integrity violation.
Most of the students also can do substantially better documentation than when they started the course. Over half the class is now routinely commenting on the meanings of their variables and the return values of their functions, though not always as clearly as I would like.
I’m considering rewriting my old “document-this-code” assignment from when I taught technical writing, updating it for Python and adding it to the course, which is already pretty full with 7 programs, a fellowship application, and a research paper, in addition to the “content” material about bioinformatic models and algorithms. The workload is high for a 5-unit course, which should total only about 120 hours, including the 35 hours of class time (I suspect that the workload varies from 100 to 150 hours already), and so I’m hesitant to add another assignment. It seems that many of the students have never seen a decently commented program and have no idea what I mean when I complain about the vagueness of their comments. They understand when I complain about the lack of comments, but don’t know what they should put in the comments.
All the students in my course have had previous programming courses, often several such courses. (I had a couple of students start the course this year without prior programming courses, and drop after the first two assignments, with the intent of trying again after taking some programming courses.) If I remember right, only one of the students started the course with a good documentation style.
I understand why the huge CS courses can’t provide the detailed feedback needed to get students to document well, but I think that it is a shame. Writing programs is like writing papers in many ways, with the same concerns about organization at different scales and the need for clarity, completeness, correctness, and conciseness (the four Cs of technical writing—I was going to write a blog post about that over a year ago, and never finished it—one of my 167 unfinished draft posts). Programming classes should be taught like writing classes, with detailed feedback from experienced programmers, but I’m afraid that administrators are so in love with mass production that they are more likely to want to make programming courses bigger with less feedback to the students rather than move to a high feedback, high cost approach that could actually produce good programmers.
Of course, administrators would love to convert the writing courses into mass-production classes also. Writing classes used to have only 20 students per instructor, in order to provide adequate feedback, but the continued defunding of instructional activity in favor of administrative bloat and constructing new research buildings has resulted in writing classes having more like 30 students per instructor, with the consequent decrease in the quantity and quality of feedback to the students. It is only a matter of time before administrators decide that MOOCs with “peer feedback” are good enough for the peons, and eliminate small classes and professional feedback from writing courses also.