# Gas station without pumps

## 2015 September 1

### Pedagogy for bioinformatics teaching

Filed under: Circuits course — gasstationwithoutpumps @ 10:48
Tags: , , , , ,

I was complaining recently about the dearth of teaching blogs in my field(s), and serendipitously almost immediately afterwards, I read a post by lexnederbragt Active learning strategies for bioinformatics teaching:

The more I read about how active learning techniques improve student learning, the more I am inclined to try out such techniques in my own teaching and training.

I attended the third week of Titus Brown’s “NGS Analysis Workshop”. This third week entailed, as one of the participants put it, ‘the bleeding edge of bioinformatics analysis taught by Software Carpentry instructors’ and was a unique opportunity to both learn different analysis techniques, try out new instruction material, as well as experience different instructors and their way of teaching. …

I demonstrated some of my teaching and was asked by one of the students for references for the different active learning approaches I used. Rather then just emailing her, I decided to put these in this blog post.

It is good to see someone blogging about teaching bioinformatics—there aren’t many of us doing it, and most of us are more focused on research than on our pedagogical techniques.  For that matter, in my bioinformatics courses, I’ve only been making minor tweaks to my teaching techniques—increasing wait time after asking questions, randomizing cold calls better, being more aware of the buildup of clutter on the whiteboard, … .  Where I’ve been focusing my pedagogic attention is on my applied electronics course and (to a lesser extent) the freshman design seminar.

I’ll be starting my main bioinformatics course in just over 3 weeks, a first-quarter graduate course that is also taken by seniors doing a BS in bioinformatics.  This will be the 14th time I’ve taught the course (every year since 2001, except for one year when I took a full-year sabbatical).  Although the course has evolved somewhat over that time, it is difficult for me to make major changes to something I’ve taught so often—I’ve already knocked off most of the rough edges, so major changes will always seem inferior, even if they would end up being better after a year or two of tweaking.  I think that major changes in the course would require a change of instructor—something that will have to be planned for, as I’ll be retiring in a few years.

My main goals in this core bioinformatics course are to teach some stochastic modeling (particularly the importance of good null models), dynamic programming (via Smith-Waterman alignment), hidden Markov models, and some Python programming.  The course is pretty intense (the Python programming assignments take up a lot of time), but I think it sets the students up well for the subsequent course in computational genomics (which I do not teach) and for general bioinformatics programming in their research labs. I don’t cover de Bruijn graphs or assembly in this course—those are covered in subsequent courses, though both the exercises Lex mentions seem useful for a course that covers genome assembly.

The live-coding approach that Lex mentions in his blog seems more appropriate for an undergrad course than for a grad course.  I do use that approach for teaching gnuplot in my applied electronics course, though I’ve had trouble getting students to bring their data sets and laptops to class to work on their own plots for the gnuplot classes—I’ll have to emphasize that expectation next spring.

It might be possible to use a live-coding approach near the beginning of the quarter in the bioinformatics course—on the first assignment when I’m trying to get students to learn the “yield” statement for make generators for input parsing. I’ve been thinking that a partial worked example would help students get started on the first program, so I could try live coding half the assignment, and having them finish it for their first homework.

One of the really nice things about Python is how easily one can create input handlers that spit out one item at a time and how cleanly one can interface them to one-pass algorithms. Way too many of the students have only done programming in a paradigm that reads all input, does all processing, and prints all output.  Although there are some bioinformatics programs that need to work that way, most bioinformatics tasks involve too much data for that paradigm, and programs need to process data on the fly, without storing it all.  Getting students to cleanly separate I/O from processing while processing only one item at time is the primary goal of the first two “warmup” Python programs in the course.

One thing I will have to demonstrate in doing the live coding is writing the docstring before writing any of the code for a routine.  Students (and professional programmers) have a tendency to code first and document later, which often turns into code-first-think-later, resulting in unreadable, undebuggable code. I should probably make a bigger point of document-first coding in the gnuplot instruction also, though the level of commenting needed in gnuplot is not huge (plot scripts tend to be fairly simple programs).

## 2015 June 17

### PteroDAQ bug fix

Now that my son is home from college, I’m getting him to do some bug fixes to the PteroDAQ data acquisition system he wrote for my class to use. The first fix that we’ve put back into the repository was for a problem that was noticed on the very old slow Windows machines in the lab—at high sampling rates, the recording got messed up.  The recording would start fine, then get all scrambled, then get scrambled in a different way, and eventually return to recording correctly, after which the cycle would repeat.  Looking at the recorded data, it was as if bytes were getting lost and the packets coming from the KL25Z were being read in the wrong frame.  As more bytes got lost the frameshift changed until eventually the packets were back in sync.  There seemed to be 5 changes in behavior for each cycle until things got back in sync.

This happened at a very low sampling rate on the old Windows machines, but even on faster machines still happened at a high enough sampling rate.

What the program was designed to do was to drop entire packets when the host couldn’t keep up with the data rate and the buffer on the KL25Z filled up, but that didn’t seem to be what was happening.  The checksums on the packets were not failing, so the packets were being received correctly on the host, which meant that the problem had to be before the checksums were added.  That in turn suggested a buffer overflow for the queuing on the KL25Z board.  More careful examination of the recordings indicated that when we got back into sync, exactly 4096 packets of 10 bytes each had been lost, which suggested that the 5 changes in behavior we saw during the cycle corresponded to 5 losses of the 8192-byte buffer.

We suspected a race condition between pushing data onto the queue and popping it off, so modified the code to turn off interrupts during queue_pop and queue_avail calls (we also made all the queue variables “volatile”, to make sure that the compiler wouldn’t optimize them out reads or writes, though I don’t think it was doing so).  This protection for the queue pop and availability calls changed the behavior to what was expected—at low sampling rates everything works fine, and at high sampling rates things start out well until the queue fills up, then complete packets are dropped when they won’t fit on the queue, and the average sampling rate is constant independent of the requested sampling rate, at the rate that the packets are taken out of the queue.

On my old MacBook Pro, the highest sampling rate that can be continued indefinitely for a single channel is 615Hz (about 6150 bytes/sec transferred).  On the household’s newer iMac, the highest sampling rate was 1572Hz (15720 bytes/sec). (Update, 2015 Jun 18: on my son’s System76 laptop, the highest sampling rate was 1576Hz.)

One can record for short bursts at much higher sampling rates—but only for $819.2 /(f_{s} - \max f_{s})$ for a single channel (8192 bytes at 10 bytes/packet is 819.2 packets in the queue).  At 700Hz, one should be able record for about 9.6376 seconds on my MacBook Pro (assuming a max sustained rate of 615 Hz).  Sure enough, the first missing packet is the 6748th one, at 9.6386 s.

I thought that 2 channels (12-byte packets) should be accepted on my MacBook Pro at (10bytes/12bytes)615Hz, or 512.5Hz, but the observed maximum rate is 533Hz, so it isn’t quite linear in the number of bytes in the packet.  Four channels (16-byte packets) run at 418Hz. There is some fixed overhead in addition to the per-byte cost on the host computer.

There is another, more fundamental limitation on the PteroDAQ sampling rate—how fast the code can read the analog-to-digital converter and push the bytes into the queue.  That seems to be 6928Hz, at which speed the longest burst we can get without dropping packets should be just under 130ms (it turned out to lose the 819th packet at 118.22ms, so I’m a little off in my estimate).  I determined the max sampling rate by asking for a faster one (10kHz) and seeing what the actual sampling rate was at the beginning of the run, then trying that sampling rate and again checking the achieved sampling rate. With two channels, the maximum sampling rate is only 3593Hz, suggesting that most of the speed limitation is in the conversion time for the analog-to-digital converter.

The current version of PteroDAQ uses long sample times (so that we can handle fairly high-impedance signal sources) and does hardware averaging of 32 readings (to reduce noise). By sacrificing quality (more noise), we could make the conversions much faster, but that is not a reasonable tradeoff currently, when we are mainly limited by how fast the Python program on the host reads and interprets the input stream from the USB port.  We’ll have to look into Python profiling, to see where the time is being spent and try to speed things up.

## 2013 March 21

### Why Python first?

Filed under: home school,Uncategorized — gasstationwithoutpumps @ 11:21
Tags: , , , , , , ,

On one of the mailing lists I subscribe to, I advocated for teaching Python after Scratch to kids (as I’ve done on this blog: Computer languages for kids), and one parent wanted to know why, and whether they should have used Python rather than Java in the home-school course they were teaching.  Here is my off-the-cuff reply:

Python has many advantages over Java as a first text-based language, but it is hard for me to articulate precisely which differences are the important ones.

One big difference is that Python does not require any declaration of variables. Objects are strongly typed, but names can be attached to any type of object—there is no static typing of variables. Python follows the Smalltalk tradition of “duck typing” (“If it walks like a duck and quacks like a duck, then it is a duck”). That means that operations and functions can be performed on any object that supports the necessary calls—there is no need for a complex class inheritance hierarchy.

Java has a lot of machinery that is really only useful in very large projects (where it may be essential), and this machinery interferes with the initial learning of programming concepts.

Python provides machinery that is particularly useful in small, rapid prototyping projects, which is much closer to the sorts of programming that beginners should start with. Python is in several ways much cleaner than Java (no distinction between primitive types and objects, for example), but there is a price to pay—Python can’t do much compile time optimization or error checking, because the types of objects are not known until the statements are executed. There is no enforcement of information hiding, just programmer conventions, so partitioning a large project into independent modules written by different programmers is more difficult to achieve than in statically typed languages with specified interfaces like Java.

As an example of the support for rapid prototyping, I find the “yield” statement in Python, which permits the easy creation of generator functions, a particularly useful feature for separating input parsing from processing, without having to load everything into memory at once, as is usually taught in early Java courses. Callbacks in Java are far more complicated to program.

Here is a simple example of breaking a file into space-separated words and putting the words into a hash table that counts how often they appear, then prints a list of words sorted by decreasing counts:

```def readword(file_object):
'''This generator yields one word at a time from a file-like object, using the white-space separation defined by split() to define the words.
'''
for line in file_object:
words=line.strip().split()
for word in words:
yield word

import sys
count = dict()
count[word] = count.get(word,0) +1
word_list = sorted(count.keys(), key=lambda w:count[w], reverse=True)
for word in word_list:
print( "{:5d} {}".format(count[word], word) )
```

Note: there is a slightly better way using Counter instead of dict, and there are slightly more efficient ways to do the sorting—this example was chosen for minimal explanation, not because it was the most Pythonic way to write the code. Note: I typed this directly into the e-mail without testing it, but I then cut-and-pasted it into a file—it seems to work correctly, though I might prefer it if if the sort function used count and then alphabetic ordering to break ties. That can be done with one change:

```word_list = sorted(count.keys(), key=lambda w:(-count[w],w))
```

Doing the same task in Java is certainly possible, but requires more setup, and changing the sort key is probably more effort.

Caveat: my main programming languages are Python and C++ so my knowledge of Java is a bit limited.

Bottom-line: I recommend starting kids with Scratch, then moving to Python when Scratch gets too limiting, and moving to Java only once they need to transition to an environment that requires Java (university courses that assume it, large multi-programmer projects, job, … ). It might be better for a student to learn C before picking up Java, as the need for compile-time type checking is more obvious in C, which is very close to the machine. Most of the objects-first approach to teaching programming can be better taught in Python than in either C or Java. For that matter, it might be better to include a radically different language (like Scheme) before teaching Java.

The approach I used with my son was more haphazard, and he started with various Logo and Lego languages, added Scratch and C before Scheme and then Python.  He’s been programming for about 6 years now, and has only picked up Java this year, through the Art of Problem Solving Java course, which is the only Java-after-Python course I could find for him—most Java courses would have been far too slow-paced for him.  It was still a bit low-level for him, but he found ways to challenge himself by stretching the assigned problems into more complicated ones.  His recreational programming is mostly in Python, but he does some JavaScript for web pages, and he has done a little C++ for Arduino programming (mostly the interrupt routines for the Data Logger code he wrote for me).  I think that his next steps should be more CS theory (he’s just finished an Applied Discrete Math course, and the AoPS programming course covers the basics of data structures, so he’s ready for some serious algorithm analysis), computer architecture (he’s started learning about interrupts on the Arduino, but has not had assembly language yet), and parallel programming (he’s done a little multi-threaded programming with queues for communication for the Data Logger, but has not had much parallel processing theory—Python relies pretty heavily on the global interpreter lock to avoid a lot of race conditions).

## 2012 December 19

### Tested Python and Arduino installation

Filed under: Circuits course,Data acquisition — gasstationwithoutpumps @ 18:22
Tags: , , ,

My son and I bicycled up to campus today to test out his data logger code (that will be used in the circuits course) on the Windows computers in the lab.  The lab support staff had told me yesterday that they had gotten Python 2.7.3 installed and the Arduino 1.0.3 development environment, as well as the PySerial module that the data logger code requires.

My son has been working pretty intensely on the code lately, doing a complete refactoring of the code to use TKinter (instead of PyGUI, which is difficult to install and quite slow) and to have a more user-friendly GUI.  He also wanted to make the code platform independent (Windows, Mac, and Linux), though he’d only tested on our Macs at home before today.  He’s also trying to make the Python part of the code (the user interface) work in both Python 2.7.3 and Python 3.3, though the languages are not precisely compatible.

We found three problems today:

1. Python 2.7.3 was not quite completely installed.  They’d forgotten to update the path to include C:\Python27\  (The path should also include where the Arduino software was installed—I forget where that was now.)
2. The device drivers for the Arduinos were not installed.  On Macs, there are no drivers to install, but on Windows, you need different drivers for different Arduino boards, and it seems you need to have the board plugged into the USB port in order to install the drivers. (Instructions at http://arduino.cc/en/Guide/Windows#t0c4)
3. My son had forgotten to include one of the dependencies in his list of what needed to be installed—the Arduino Timer1 module from http://playground.arduino.cc/code/Timer1/ .

Because they had given me administrator privileges on the machines, my son was able to fix one of the machines to run the data logger code (though he had a couple of minor changes to make in his code for Windows compatibility also).

For the data logger debugging, about all I did was type in my password for him, and once do a search to see where the Arduino code was hidden.

When he has finished debugging and documenting the code, my son will be releasing it with a permissive license on bitbucket, and I’ll be putting links to it here and on the course web page.

## 2012 October 12

### Rapid attrition in grad course

Filed under: Uncategorized — gasstationwithoutpumps @ 21:14
Tags: , , , ,

I’m teaching two graduate courses this quarter.  One is our “how to be a grad student” course, which all the first-year students take. It contains a lot of TA training, group advising, and “soft skills” (LaTeX,BibTeX,  preparing posters, preparing transparencies, oral presentations, voice projection, …).  It has been going fairly well, but I have the first assignment (a LaTeX assignment) to grade this weekend. The other course is the core bioinformatics course for our grad program and is also required of undergrads in bioinformatics (a very small group, since they are required to take 2–3 grad courses).

I’ve had a fair amount of attrition in the bioinformatics core course.  On the first day of class, I had 25 students.  One week later, 21 students turned in the first assignment.  After that was returned, three more students dropped before the second assignment and one has not yet turned in the second assignment, so I have only 17 to grade.  I don’t think I’ve ever had a 32% attrition rate before.  I don’t know much about the four students who dropped before turning anything in—they may have just been shopping for classes and decided that this one would not meet their needs.

The three who dropped and the one who didn’t turn in the second assignment all were suffering from inadequate prior training in programming—some of them didn’t even have the concept of procedures in their repertoire. The prerequisite for the course is about a year of programming courses in a block-structured or object-oriented programming language, but some of the new grads and grad students from other departments had not had that. A couple of them plan to take lower-level programming classes this year, and come back next year to take the bioinformatics course, after they have the fundamentals.  (One has asked to be allowed to sit in on lectures this year, without doing the assignments, so that he gets at least some exposure to the material—I have no trouble with that.)

Last year, while I was on sabbatical, a good friend of mine who had previously taken and TAed the core course taught it, and he introduced a new first assignment, replacing the rather easy warmup assignment I used in the past (a simple FASTA parser) with a much more challenging one that required parsing several different variants of FASTQ.  He provided a scaffold for the students to work from, but it still took two weeks for the students to do the first assignment, and the students and the TA both thought it was too challenging for a first assignment for students who had never programmed in Python before.

I consulted with all the former students (at least all the ones who responded to my request for advice on the compbio mailing list).  Students were divided about the value of the scaffold—some saw it as essential to learning good Pythonic style, while others felt like a simpler first assignment would be a better way to start.  One student recommended that students build their own scaffolds, which struck me as an excellent idea.  So this year, I came up with two new assignments to replace the old first assignment.  The first assignment was to build a scaffold for future programs, and the second was to build a pair of FASTA and FASTQ parsers and conversion programs to convert between fasta+qual file pairs and fastq.  The second program is somewhat more difficult than the first program I used to use, and about the same difficulty as the one used last year.

The scaffolding assignment had a very simple task (read a text file and output the unique words with counts), but I required a particular structure to the program: a generator function that yielded words, an argument parsing function that used argparse, an output function that would print the words in a two-column format with three different sorting options, and a main program (which I provided) that called the other functions.  Most of the students came up with pretty decent implementations (there were some little bits of fluff that I commented on in the feedback), but a few struggled mightily.

I’ll be grading the second assignment this weekend, and I’m looking forward to seeing whether the students have benefitted from the scaffolding assignment—I’m hoping that these programs will be well structured and easy to read.  If they are, I’ll be happy with the addition of the scaffolding assignment.  If not, I’ll have to rethink those assignments for next year.

Next Page »

Create a free website or blog at WordPress.com.