Mark Guzdial’s post Automatically grading programming homework: Echoes of Proust pointed me to an MIT press release, Automatically grading programming homework, which starts with the claim
Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), working with a colleague at Microsoft Research, have developed a new software system that can automatically identify errors in students’ programming assignments and recommend corrections.
While such a debugging aid may be useful for finding and correcting common errors in small programs (such as those found in beginning programming courses), it does not address what I see as one of the main goals of hand-grading student programs: evaluating how well their programs are structured and documented. I often spend as much time grading the comments, decomposition into procedures or methods, data structures chosen, error-handling, and variable names as the algorithmic details—none of which are addressed by this grading scheme.
In a comment on Guzdial’s post, Edward Bujak wrote,
Proust automatic grading caught 85% of the semantic errors since the domain (the specific program) was specified. In a short 6-month consultant position at ETS, we replicated this with the automated grading of the APCS free response questions. This was when the APCS test was in Pascal. The program prompt was known so an expert System (ES) was trained for that problem which would query an Abstract Syntax Tree (AST) dynamically constructed from the students submitted program. We graded on good and bad. The grading was reliable, consistent, and outperformed humans. It never saw the light of day.
Given that the AP CS problems are tiny coding problems that are not checked for the things I look for in hand grading anyway, it seems like an ideal application of automated grading of programs. The syntax parser would have to be very forgiving (as able to recover from missing semicolons or mistyped variables as a human grader) to grade fairly.
Of course, the AP exams are handwritten, not typed, and OCR is still too unreliable for grading hand-written exams. The data entry to enable automatic grading of AP CS exams probably exceeds the cost and error rate of hand grading the exams, so it is no wonder that the expert system Bujak worked on never saw the light of day. Perhaps someday the AP exams will be done with keyboard entry, but the extra opportunities that introduces for cheating make it unlikely to be adopted any time soon.
I suspect that an adequate automatic grader for CS1 problems is possible (if you ignore comments, programming style, variable names, and other important things that CS1 should teach), by combining the generic automatic debugging approaches MIT is using with the problem-specific expert systems of Proust and whatever Bujack worked on for ETS. The effort may be useful for making MOOCs a little less awful at grading, though it would not help with other problems with the pedagogic approaches of mass instruction.