Gas station without pumps

2015 June 17

PteroDAQ bug fix

Now that my son is home from college, I’m getting him to do some bug fixes to the PteroDAQ data acquisition system he wrote for my class to use. The first fix that we’ve put back into the repository was for a problem that was noticed on the very old slow Windows machines in the lab—at high sampling rates, the recording got messed up.  The recording would start fine, then get all scrambled, then get scrambled in a different way, and eventually return to recording correctly, after which the cycle would repeat.  Looking at the recorded data, it was as if bytes were getting lost and the packets coming from the KL25Z were being read in the wrong frame.  As more bytes got lost the frameshift changed until eventually the packets were back in sync.  There seemed to be 5 changes in behavior for each cycle until things got back in sync.

This happened at a very low sampling rate on the old Windows machines, but even on faster machines still happened at a high enough sampling rate.

What the program was designed to do was to drop entire packets when the host couldn’t keep up with the data rate and the buffer on the KL25Z filled up, but that didn’t seem to be what was happening.  The checksums on the packets were not failing, so the packets were being received correctly on the host, which meant that the problem had to be before the checksums were added.  That in turn suggested a buffer overflow for the queuing on the KL25Z board.  More careful examination of the recordings indicated that when we got back into sync, exactly 4096 packets of 10 bytes each had been lost, which suggested that the 5 changes in behavior we saw during the cycle corresponded to 5 losses of the 8192-byte buffer.

 

We suspected a race condition between pushing data onto the queue and popping it off, so modified the code to turn off interrupts during queue_pop and queue_avail calls (we also made all the queue variables “volatile”, to make sure that the compiler wouldn’t optimize them out reads or writes, though I don’t think it was doing so).  This protection for the queue pop and availability calls changed the behavior to what was expected—at low sampling rates everything works fine, and at high sampling rates things start out well until the queue fills up, then complete packets are dropped when they won’t fit on the queue, and the average sampling rate is constant independent of the requested sampling rate, at the rate that the packets are taken out of the queue.

On my old MacBook Pro, the highest sampling rate that can be continued indefinitely for a single channel is 615Hz (about 6150 bytes/sec transferred).  On the household’s newer iMac, the highest sampling rate was 1572Hz (15720 bytes/sec). (Update, 2015 Jun 18: on my son’s System76 laptop, the highest sampling rate was 1576Hz.)

One can record for short bursts at much higher sampling rates—but only for 819.2 /(f_{s} - \max f_{s}) for a single channel (8192 bytes at 10 bytes/packet is 819.2 packets in the queue).  At 700Hz, one should be able record for about 9.6376 seconds on my MacBook Pro (assuming a max sustained rate of 615 Hz).  Sure enough, the first missing packet is the 6748th one, at 9.6386 s.

I thought that 2 channels (12-byte packets) should be accepted on my MacBook Pro at (10bytes/12bytes)615Hz, or 512.5Hz, but the observed maximum rate is 533Hz, so it isn’t quite linear in the number of bytes in the packet.  Four channels (16-byte packets) run at 418Hz. There is some fixed overhead in addition to the per-byte cost on the host computer.

There is another, more fundamental limitation on the PteroDAQ sampling rate—how fast the code can read the analog-to-digital converter and push the bytes into the queue.  That seems to be 6928Hz, at which speed the longest burst we can get without dropping packets should be just under 130ms (it turned out to lose the 819th packet at 118.22ms, so I’m a little off in my estimate).  I determined the max sampling rate by asking for a faster one (10kHz) and seeing what the actual sampling rate was at the beginning of the run, then trying that sampling rate and again checking the achieved sampling rate. With two channels, the maximum sampling rate is only 3593Hz, suggesting that most of the speed limitation is in the conversion time for the analog-to-digital converter.

The current version of PteroDAQ uses long sample times (so that we can handle fairly high-impedance signal sources) and does hardware averaging of 32 readings (to reduce noise). By sacrificing quality (more noise), we could make the conversions much faster, but that is not a reasonable tradeoff currently, when we are mainly limited by how fast the Python program on the host reads and interprets the input stream from the USB port.  We’ll have to look into Python profiling, to see where the time is being spent and try to speed things up.

2014 June 6

Wrapping up Applied Circuits and PteroDAQ bug fix

Today was the last day of the Applied Circuits class, and the students were turning in their 10th lab report.  It’s all over but the grading (though I expect a few more redone assignments to be turned in on Monday).

Today’s class started with my explaining a bug that was just found in the PteroDAQ code.

Symptoms: On the EKG assignment a number of students ran into trouble in the lab with the PteroDAQ software suddenly switching from normal recording to a sawtooth waveform, then spontaneously switching back after a while.  When plotting the saved data with gnuplot, the timestamps were all messed up in the part of the recording where the sparkline had displayed a sawtooth waveform, but returned to normal when the sparkline did.  We had never seen this behavior on  the Macs, nor had we seen it previously on the Windows machines.  The one new thing in this lab is that we were using a higher sampling rate on the Windows machines than we had previously used.

Diagnosis 1: The sawtooth waveform suggested to me that the low-order part of the timestamp was being taken as the signal, and the disordered timestamps as intrusion of the data into the timestamp.  I suspected that the old, slow Windows machines were not able to keep up with the USB serial traffic, and that they were losing a byte and getting a framing error.

I suggested this diagnosis to my son, but he showed me the framing character and checksum used to detect malformed packets, and pointed out that even if a malformed packed passed the checksum, synchronization would be re-established within a packet or two. Although the behavior certainly looks like a framing error, the packets should not go out of frame consistently.

Today, after watching the Shakespeare To Go 50-minute production of Hamlet with me, my son accompanied me to the lab to observe the behavior for himself.  We got the behavior very quickly with a 300Hz sampling rate, and the error message for a checksum failure was not printed. He added a print statement in the code where the data packets were getting their checksums checked, so that we could see what data was coming in. We looked at the transition from good behavior to bad, and noted that the packets were still the right length, but the information in the packet seemed to have been shifted. So there seemed to be a framing error, but not in the USB communication.

Diagnosis 2: This shift indicated that the problem was in the software running on the KL25Z board, rather than in the software running on the Windows host.  The packets to be sent out the USB serial port are stored in a queue in the KL25Z, and I conjectured that the queue was overflowing.  Since a circular buffer is used for the queue, writing when the queue is full would overwrite data for a different packet.  Because the queue length in bytes was not a multiple of the size of the items on the queue, this overwriting would be out of phase, and we would observe periodic switching between in-frame and out-of-frame packets, which is what we were observing.

We checked the code, and found that the test for sufficient room in the buffer for another packet had not been included in the routine for adding a new event to the queue.  The code to check the remaining space had been written, but not not used yet.  A fifteen-minute bug fix, while I went to my office to get my materials for my class, confirmed that this had indeed been the problem.  He will be pushing the fix to https://bitbucket.org/abe_k/pterodaq this evening.  The fix consists of discarding any event for which there is insufficient room in the queue.  Some loss of data is unavoidable when the queue fills up, and this delays the loss as much as possible, as well as minimizing how many events are discarded.  Since the events are timestamped, losing an event will often have little or no consequence for downstream analysis or plotting.

The first part of the class was explaining the bug to the students, which I had to do in a rather handwavy fashion, since most of the class has not had programming, much less queues and circular buffers.

We then talked about a variety of different things: deadlines for redos, when the t-shirts would be available, instructor feedback forms to fill out online, differences between bioengineering programs at the various UCs, what the job market was like for bioengineers, and other general advising sorts of things. I also reviewed some of the goals of the course in terms of engineering thinking, tinkering vs. engineering by design, the usefulness of models that aren’t perfectly accurate, trying to switch them from answer-getting to keeping in mind the meaning of what they were doing, sanity checks, methodical work and lab notebooks records, and that the real world trumps any models (“try it and see!”).

Incidentally, this year’s t-shirts have that motto on the shirts:

2014 Applied Circuits t-shirt has the design only on the front, and has the class slogan above the cyberslug.  (Click for high resolution image)

2014 Applied Circuits t-shirt has the design only on the front, and has the class slogan above the cyberslug. (Click for high resolution image)

When the students finally ran out of questions, I had about 10 minutes left, so I introduced them to the Wien-bridge oscillator, first by reminding them of the bridge-null condition (which them almost remembered from the quiz it had been on), then deriving that \omega RC =1 and R_{2} = 2 R_{1}. I even had time to tell them about how the amplitude of the original oscillators was regulated by having a light bulb as a non-linear resistor in place of R1.

After class I attempted to help one of the students get PteroDAQ working with his laptop, but we got as far as determining that the drivers were missing for the USB serial device on the KL25Z board, and that the Windows 7 driver installation did not fix the problem before he had to go to another class. My son and I will have to look for Windows 8 driver initialization that will set things up properly for the KL25Z board—this might be tricky as we have no access to a Windows 8 machine to test putative solutions on.

%d bloggers like this: