Gas station without pumps

2015 July 11

Improving PteroDAQ

Filed under: Circuits course,Data acquisition — gasstationwithoutpumps @ 19:53
Tags: , , , , ,

For the past couple of years, I’ve been using the PteroDAQ data acquisition system that my son wrote for the KL25Z board (a second-generation system, replacing the earlier Arduino Data Logger he wrote).  He has been working on and off on a multi-platform version of PteroDAQ for over a year, and I finally asked him to hand the project over to me to complete, as I want it much more than he does (he’d prefer to spend his time working on new products for his start-up company, Futuristic Lights).

It has been a while since he worked on the code, and it was inadequately documented, so we’ve been spending some time together just digging into the code to figure out the interfaces, which I’ve been adding comments and docstrings to, so that I can debug the code.  Other than the lack of documentation, the code is fairly well written, so figuring out what is going on is not too bad (except in the GUI built using tkinter—like all GUI code, it is a complicated mess, because the APIs for building GUIs are a complicated mess).

The man goal of his multi-platform code was to support Arduino ATMega-based boards and the KL25Z board, while making it relatively easy to add more boards in future.  The Arduino code is compiled and downloaded by the standard Arduino IDE, while the KL25Z board code is compiled and downloaded with the MBED online compiler.  He has set up the software with appropriate #ifdef checks, so that the same code files can be compiled for either architecture.  The knowledge of what features are available on each board and how to request them is stored in one module of the python program that runs on the host computer.  As part of the cleaning up, we’ve been moving some of the code around to make sure that all the board-specific stuff is in that file, with a fairly clean interface.

He believed that he had gotten the new version working on the Arduino boards, but not on the KL25Z board.  It turned out that was almost true—I got the system working with the Leonardo board (which uses USB communication and has a weird way to get reset) almost immediately, but had to do a number of little bug fixes to get it working with other Arduino boards (which use UART protocol and a different way of resetting). It turned out that the system also worked with the KL25Z after those bug fixes, so he was very close to having it ready to turn over to me.

One of the first things I did was to time how fast the boards would run with the new code—the hope was that the cleaner implementation would have less overhead and so support a higher sampling rate.  Initial results on the Arduino boards were good, but quite disappointing on the KL25Z boards, which use a faster processor and so should be able to support much higher speeds.  We tracked the problem down to the very high per-packet overhead of the USB packets in the mbed-provided USBSerial code.  (He had tried writing his own USB stack for bare-metal ARM, but had never gotten it to work, so we were using the mbed code despite its high overhead.)

There was a simple fix for the speed problem: we stopped sending single-character packets and started using the ability of the MBED code (and the Arduino code for the Leonardo) to send multi-character packets.  With this change, we got much better sampling rates (approximate max rate):

channels Leonardo Uno/Duemilanove/Redboard KL25Z 32x avg KL25Z, 1x avg
0  13kHz  4.5kHz  8.2kHz
1 analog  5kHz  2.7kHz  5.5kHz  8.1kHz
2 analog  3kHz  1.9kHz  3kHz  8.1kHz
7 digital  6.5kHz  3.4kHz  8.2kHz

It is interesting that the Leonardo (a much slower processor) manages to get a higher data rate than the KL25Z when sending just time stamps.  I think that I can get another factor of 3 or 4 speed on the KL25Z by flushing the packets less often, though, so I’ll try that.

By flushing only when needed, I managed to improve the KL25Z performance to

channels KL25Z 32x avg KL25Z, 1x avg
0  17kHz
1 analog  6.3kHz  10kHz ??
2 analog  3.3kHz

Things get a bit hard to measure above 10kHz, because the board runs successfully for several hundred thousand samples, then I start losing characters and getting bad packets. The failure mode using my son’s faster Linux box is different: we lose full packets when going too fast—which is what PteroDAQ is supposed to do—and the speed at which the failure starts happening is much higher (maybe 23kHz). In other words, what I’m seeing now are the limitations of the Python program on my old MacBook Pro. It does bother me that the Mac seems to be quietly dropping characters when the Python program can’t clear the USB serial input fast enough.

The KL25Z slows down when doing the 32x hardware averaging, because the analog-to-digital conversion is slow—particularly when doing 32× hardware averaging.  I think that we’ve currently set things up for a 6MHz  ADC clock, with short sampling times, which means that a single-ended 32× 16-bit conversion takes around 134µs and the sampling rate is limited by the conversion times (differential measurements are slower, around 182µs).

There is a problem in the current version of the code, in that interrupts that take longer to service than the interrupt time result in PteroDAQ lying about its sampling rate.  I can fix this on the KL25Z by using a separate timer, but the Arduino boards have rather limited timer resources, and we may just have to live with it on them.  At least I should add an error flag that indicates when the sampling rate is higher than board can handle.

We had a lot of trouble yesterday with using the bandgap reference to set the voltage levels.  It turns out that on the Arduino boards, the bandgap channel is a very high impedance, and it takes many conversion times before the conversion settles to the final value (nominally 1.1V).  Switching channels and then reading the bandgap is nearly useless—the MUX has to be left on the bandgap for a long time before reading the value means anything.  If you read several bandgap values in quick succession, you can see the values decaying gradually from the value of the previously read channel to the 1.1V reference.

The bandgap on the KL25Z is not such a high-impedance source, but there is some strange behavior when reading it with only 1× averaging—some values seem not to occur and the distribution is far from gaussian.  I recorded several thousand measurements with 1×, 4×, 8×, 16×, and 32× averaging:

The unaveraged reading seems to be somewhat higher than  any of the hardware-averaged ones.

The unaveraged (1×) reading seems to be somewhat higher than any of the hardware-averaged ones.

I was curious about how the noise reduced on further averaging, and what the distribution was for each of the averaging levels. I plotted log histograms (using kernel-density estimates of the probability density function: gaussian_kde from the scipy python package) of the PteroDAQ-measured bandgap voltages.  The PteroDAQ is not really calibrated—the voltage reference is read 64 times with 32× averaging and the average of those 64 values taken to be 1V,  but the data sheet says that the  bandgap could be as much as 3% off (that’s better than the 10% error allowed on the ATMega chips).

Without averaging, there is a curious pattern of missing values, which may be even more visible in the rug plot at the bottom than in the log histogram.

Without averaging, there is a curious pattern of missing values, which may be even more visible in the rug plot at the bottom than in the log histogram.

The smoothed log-histogram doesn't show the clumping of values that is more visible in the rug plot.

The smoothed log-histogram doesn’t show the clumping of values that is more visible in the rug plot.

With eight averages, the distribution begins to look normal, but there is still clumping of values.

With eight averages, the distribution begins to look normal, but there is still clumping of values.

With 16 averages, things look pretty good, but mode is a bit offset from the mean still.

With 16 averages, things look pretty good, but mode is a bit offset from the mean still.

Averaging 32 values seems to have gotten an almost normal distribution.

Averaging 32 values seems to have gotten an almost normal distribution.

Interestingly, though the range of values reduces with each successive averaging, the standard deviation does not drop as much as I would have expected (namely, that averaging 32 values would reduce the standard deviation to about 18% the standard deviation of a single value). Actually, I knew ahead of time that I wouldn’t see that much reduction, since the data sheet shows the effective number of bits only increasing by 0.75 bits from 4× t0 32×, while an 8-fold increase in independent reads would be an increase in effective number of bits of 1.5 bits.  The problem, of course, is that the hardware averaging is of reads one right after another, in which the noise is pretty highly correlated.

I think that the sweet spot for averaging is the 4× level—almost as clean as 32×, but 8 times faster.  More averaging improves the shape of the distribution a little, but doesn’t reduce the standard deviation by very much.  Of course, if one has a low-frequency signal with high-frequency noise, then heavier averaging might be worthwhile, but it would probably be better to sample faster with the 4× hardware averaging, and use a digital filter to remove the higher frequencies.

The weird distribution of values for the single read is not a property of the bandgap reference, but of the converter.  I made a voltage divider with a couple of resistors to get a voltage that was a fixed ratio of the supply voltage (so should give a constant reading), and saw a similar weird distribution of values:

The distribution of single reads is far from a normal  noise distribution, with fat tails on the distribution and clumping of values.

The distribution of single reads is far from a normal noise distribution, with fat tails on the distribution and clumping of values.


With 32× sampling, the mean is 1.31556 and the standard deviation 5.038E-04, with an excellent fit to a Gaussian distribution.


  1. Bitbucket tells me:

    You do not have access to the wiki.

    Use the links at the top to get back.

    On to the code…

    I see from that you know about struct. I would have thought that struct would be a faster, clearer, and more portable way to handle the bytes/int conversions in

    While you might find it worthwhile to run it under the profiler to see what’s really slowest, my guess is that the _readin method is your culprit. It’s typically slow to read a single byte at a time. Confirming that would be that last I knew, syscalls are substantially faster under Linux than OS X.

    If you change your protocol to have the full packet length so that it starts with a fixed-size record that includes the length of everything remaining, you can do everything in two read() invocations, which will likely be faster. Something like:

    c, ln = rd(2)
    if c == b’!’:
    data = rd(asint(ln + 2))
    cm = data.pop(0)
    chk = data.pop(-1)
    elif c == b’*’:

    But then you would need to deal with the fact that reading the length with the sync character could leave you out of sync, which would make more special case code.

    More of a change to the code structure would be my preference and would not require a protocol change: do an non-blocking read of whatever data is available, and append it to a buffer, then pull commands out of that buffer; and if enough data isn’t found, leave the partial data on the buffer. But because you are using USB and all your data is in a USB packet that would never be only partially read at the OS level, you will almost certainly never see the partial data condition. So if you are willing to accept an occasional lost packet in theory (as I understand you already are), you could do an non-blocking read into a buffer, parse valid packets out of the already-read data, and in the unlikely case of a partial packet just drop that packet and pick up with the next one as you regain sync, “eating” characters until you hit a sync character.

    Comment by Michael K Johnson — 2015 July 13 @ 03:52 | Reply

  2. I’ll ask my son (whose bitbucket site it is) to check into the wiki permissions.

    The new code is substantially faster than the old code that is on the bitbucket site, and I expect to push the new code up to the site later this week. I want to try testing on a Windows machine before pushing the new beta release.

    I’ll look into speeding up the Python code by reading more than a byte at a time, but the PySerial class is already buffering for me, so I doubt that there are any gains to be made by buffering again.

    The new code does USB packets for higher speed, but we can end up with partial data, as at high speed we send 63-byte USB packets and don’t flush at the end of every data packet. (For some reason we ran into problems with 64-byte USB serial packets—possibly because of our limited understanding of USB, possibly because of bugs in MBED or Arduino USB stacks).

    While we’re willing to lose data packets if necessary at high speed, I’d rather not lose any that have already been sent over the USB cable—that should be our bottleneck. On my old MacBook Pro, I start losing bytes (screwing up the frame synchronization) after about 500,000 single-analog-channel packets at 14kHz. The bottleneck is the Python program, which is running over 100% of the CPU (mostly in the data reading and parsing, as the new sparkline updating only takes about 10% of the CPU when the data rate is low). I think it is a bug in the Mac OS that it throws away bytes after reading them, rather than stopping reading when the buffer is full—running the new PteroDAQ under Linux does not display this behavior, but cleanly discards packets before queuing them for USB transmission.

    On the non-32U4 Arduino boards, we are doing byte-at-a-time transmissions through the UART/USB serial, currently at 500,000 baud. We may trying pushing that up to 1 Mbaud, which seems to be the fastest rate others have gotten the UART/USB to work with the Arduino boards.

    Comment by gasstationwithoutpumps — 2015 July 13 @ 08:32 | Reply

    • Oh, so PySerial is already doing the buffering I suggest. Ignore that then!

      Comment by Michael K Johnson — 2015 July 14 @ 02:52 | Reply

      • The Wiki now has better permissions.

        My son looked at the PySerial implementation, and PySerial does not seem to be doing the buffering we thought. We’re looking into implementing our own USB serial interface, particularly since we have had some difficulty with PySerial under Python 2.7 on Mac OS 10.6.8 not opening the ports—not timing out, just freezing.

        Pushing the UART-using Arduinos from 500,000 to 1,000,000 baud did not seem to improve throughput—I think that we are hitting the processor limits on those boards rather than the communication limits. With the newest version of the software under Python3.4 (including my son’s first try at our own USB serial library on the host—still no buffering), we’re still getting only 2650Hz doing a single analog channel on a Duemilanove board (at 2660Hz, we start dropping packets, though not losing bytes). We’re getting 3350Hz for 8 digital channels (the A-to-D conversion is slow). On the Leonardo (with real USB, no UART), the single analog channel can be taken up to 5375Hz.

        With the KL25Z board (single 4x analog channel), I can run for 30 seconds at 15kHz, but even at 8kHz I start losing packets after about 700,000 of them, and at 7kHz I lost about 6 bytes out of 3 million packets, (each byte loss causing a 7-packet loss until the communication managed to resync. Even at only 6kHz I lost 4 bytes (at a cost of about 7 packets each) in 2.8 million packets. This bottleneck is clearly on the host end—on a faster Linux laptop, the KL25Z board can run over 20kHz, I believe, though we haven’t run long enough tests on my son’s laptop to see what the eventual limit is and whether the Linux drivers eventually lose bytes the way the Mac OS 10.6.8 ones do.

        I’m now wondering whether the bottleneck on my laptop is the overhead of the OS calls or the Python memory management for bytes and strings—the two possibilities call for different solutions. If the limitation were in the OS calls, then I’d expect a fairly abrupt transition between good and bad behavior, with the buffer overflow on the Mac happening fairly quickly at high speed and never at low speed. But if the problem is that the garbage collection gets triggered and the Python program listens a lot less for a while, then I’d expect to see the gradual delay in how long it takes to start failing, at about 1/sampling rate, as the garbage collection gets triggered after enough trash has accumulated.

        Comment by gasstationwithoutpumps — 2015 July 14 @ 08:19 | Reply

        • Correction: the burst losses seems to be around 50–150 packets lost in a burst, not just 7. On a newer Mac (a desktop iMac running OS 10.7), I’m not seeing the loss of synchronization, but packets are cleanly dropped. With 1x averaging on the KL25Z, 25kHZ can run for a while (400k to 1.4M packets), but 25.5kHz fails fast, and even 15kHz drops packets after 3M packets). With 32x averaging, 6276Hz fails quickly, 6275Hz after 300kpackets, 6270Hz after about 630k packets, and 6200kHz is ok after 1M packets. With 4x averaging, 22kHz fails fast, and 19kHz–21kHz after about 2M packets.

          Comment by gasstationwithoutpumps — 2015 July 14 @ 10:01 | Reply

        • I confirm I can see the wiki now.

          Do you know how to run python code under the profiler? It slows it down substantially, but can help figure out what is running slow. “python -m cProfile” will print out voluminous stats at the end of the run; I’ve used it successfully to pinpoint performance problems.

          Garbage collection in CPython is only used for reaping objects in reference loops; as long as you don’t create those you end up with reference counting responsible for freeing objects when they go out of scope without any garbage collection. If you have reference loops, you can use weakref to break the loops.

          Last I knew, Linux system calls were tremendously faster than OS X.

          Comment by Michael K Johnson — 2015 July 14 @ 15:31 | Reply

          • We had already tried python -m cProfile, and found it pretty useless on multi-treaded code that has everything invoked by the GUI. Everything ended up being attributed to the procedure that created the GUI objects, and not to the routines that were called by the various callbacks.

            Comment by gasstationwithoutpumps — 2015 July 14 @ 22:35 | Reply

          • Hitting the comment reply nesting limit…

            You could add to a

            if __name__ == “__main__”:

            that doesn’t spin up a new thread and instead just reads packets and then throws them away, and then profile that, in order to avoid the gui poisoning the profiling data.

            Comment by Michael K Johnson — 2015 July 15 @ 03:49 | Reply

            • So you are suggesting building a command-line interface to the comm module (or, perhaps more importantly, the core module) that has no gui, but just sets up the communication with the board and processes packets. That may be worth trying, though it won’t tell us how much of the processing time is spent on the gui (which was a huge part of the bottleneck in v0.1 of PteroDAQ, because we had a terrible way of implementing the sparklines). I think that we’ve reduced the gui to a tiny fraction of the time, though, as it does the same amount of work at low sampling rates as high sampling rates, and the whole program uses up little CPU at low sampling rates.

              comm creates threads for reading from the USB serial port, so we still might not see whether the problems are in reading the data and checking checksums in CommPort._readin() or in parsing the data in DataAcquisition._parsedata(), but that is the main question we’re interested in currently.

              Comment by gasstationwithoutpumps — 2015 July 15 @ 09:33 | Reply

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: