# Gas station without pumps

## 2015 October 2

### What is probability?

Filed under: Uncategorized — gasstationwithoutpumps @ 17:43
Tags: , ,

Today’s class in Bioinformatics: Models and Algorithms went fairly well.

I started by collecting the first assignment and asking if there were any questions, about anything in the class.  I used a bit longer wait time than I usually use for that prompt, and was rewarded by having a reasonable question asked by someone who had clearly been hesitant to ask.  I’ve been finding that “Wait Time” is one of the most powerful techniques in Teach Like a Champion.

The first part of class was just a quick summary of DNA sequencing techniques, with an emphasis on the effect the different technologies had on the sequence data collected (read lengths, number of reads, error models).  Many of the students had already had much more extensive coverage of DNA sequencing elsewhere (there is an undergraduate course about half of which is sequencing technology), and several students were able to volunteer more up-to-date information about library preparation than I have (since they worked directly with the stuff in wet labs).

I reserved the last 15 minutes of the class for a simple question that I asked the students “What is probability?”

I managed to elicit many concepts related to probability, which I affirmed and talked about briefly, even though they weren’t directly part of the definition of probability.  This included things like “frequency”, “observable data”, and “randomness”. One volunteered concept that I put off for later was “conditional probability”—we need to get a definition of probability before we can deal with conditional probability.

Somewhat surprising this year, as that the first concept that was volunteered was that we needed an “event space”.  That is usually the hardest concept to lead students to, so I was surprised that it came up first.  It took a while to get someone to bring up the concept of “number”—that probabilities are numeric.  Someone also came up with the idea “the event space equals 1”, which I pressed them to make more precise and rigorous, which quickly became that the sum of probabilities of of events is 1.  I snuck in that probabilities of events meant a function (I usually press the students to come up with the word “function”, but we were running low on time), and got them to give me the range of the function.

Someone in the class volunteered that summation only worked for discrete event spaces and that integration was needed for continuous ones (the same person who had brought up event spaces initially—clearly someone who had paid attention in a probability class—possibly as a math major, since people who just regard probability as a tool to apply rarely remember or think about the definitions).

So by the end of the class we had a definition of probability that is good enough for this course:

• A function from an event space to [0,1]
• that sums (or integrates) to 1.

I did not have time to point out that this definition does not inherently have any notion of frequency, observable data, or randomness.  Those all come up in applications of probability, not in the underlying mathematical definition.  One of my main purposes in asking a simple definitional question (material that should have been coming from prerequisite courses) was to get broad student participation, setting the expectation that everyone contributes.  I think I got about 50% of the students saying something in class today, and I’ll try to get the ones who didn’t speak to say something on Monday.  Unfortunately, I only know about 2 names out of the 19 students, and it takes me forever to learn names, so I may have to resort to random cold calling from the class list.

In retrospect, I wish I had spent 5–10 minutes less time on DNA sequencing, so that there was a bit more time to go into probability, but it won’t hurt to review the definition of probability on Monday.

## 2014 March 6

### Guessing correction is not guessing penalty

Filed under: Uncategorized — gasstationwithoutpumps @ 06:19
Tags: , ,

It bothers me that even the New York Times, in an otherwise excellent article repeats a common misperception:

Students were docked one-quarter point for every multiple-choice question they got wrong, requiring a time-consuming risk analysis to determine which questions to answer and which to leave blank.

The guessing correction on the SAT meant that you did not have to do any “time-consuming risk analysis”—if you knew nothing about a question, it did not matter whether you guessed or left the question blank—your expected value was zero either way.

Under the no-guessing-correction system, students are forced to guess, or give up 1/5th of a point for each question left blank. Thus the new system rewards gamblers and punishes people who are cautious—rewarding a character trait that should have no influence on test scores.

Under the old system, you could guess randomly on everything you didn’t know if you wanted to—it wouldn’t hurt on average.  Under the new system, you are forced to.  Forced guessing will also penalize those who aren’t aware of time running out, and are forced to “put down their pencils” before getting a chance to randomly bubble the remaining questions. It will also penalize students who would formerly skip difficult questions, then come back to the blank ones later if they had time—leaving a question blank will come with a penalty that wasn’t there before.

I suppose it is too much for me to expected innumerate reporters (even for the NY Times) to know enough probability to understand expected values, but it saddens me that the College Board, in search of higher market share, is giving up on one of the things that they previously were doing correctly.

## 2013 July 20

### Plotting histograms

Filed under: Uncategorized — gasstationwithoutpumps @ 17:10
Tags: , ,

In any sort of experimental work, I or my students end up plotting a lot of histograms (of alignment scores, cost functions, segment lengths, … ). What I usually want to see is a probability density function (so the scaling is independent of the number of data points sampled or the bin sizes used). Most of the students end up using some crude built-in histogram plotter (in R or Matplotlib), that ends up with difficult-to-interpret axes and bad bin sizes.

I spent a couple of days experimenting with different approaches to making a Python module that can convert a generic list of numbers into a list of points to plot with gnuplot or Matplotlib. I ended up with 3 different things to control: the number of bins to use, how wide to make each bin, and whether to plot the estimated density function as a step function or linearly interpolated.

If I have n samples, I set the default number of bins to $\mbox{num\_bins} = \sqrt{n} +1$, which seems to give a good tradeoff between having so few bins that resolution is lost and so many that the shape of the distribution is buried in the sampling noise. The “+1” is just to ensure that truncating the estimate never results in 0 bins.

Most of my experimenting was in adjusting the bin widths. I came up with three main approaches, and one minor variant:

fixed-width
This is the standard histogram technique, where the full range of values is split into num_bins equal intervals, and the instances of numbers counted in the corresponding bins. This approach is very simple to program and very fast, as there is no need to sort the data (if the range is known), and bin selection is a trivial subscript computation. Since I’m projecting the range out a little from the largest and smallest numbers (based on the second largest and second smallest), I ended up sorting anyway.The fixed-width approach is very good for discrete distributions, as it can have lots of bins with 0 counts.

fixed-count
The fixed-count approach tries to make each bin have the same number of samples in it. The intent here is to have finer resolution where there are a lot of samples, can coarser resolution where there are few samples. I implemented this approach by sorting the numbers and setting thresholds in a single sweep across the data.The fixed-count approach gives good resolution at the peaks of the peaks of the density, but gets very coarse where the density is low. It does not leave any empty bins, so is not as good for discrete distributions as the fixed-bin approach.

tapered
The tapered approach is like the fixed-count approach, except that the desired counts taper off in the first and last few bins of the range. This was a rather clumsy attempt to get better resolution in the tails of a distribution.
fixed-area
This approach is a compromise between the other two trying to keep the product of the number of counts and the bin width roughly constant. I again implemented this by doing a sweep across the sorted numbers.The fixed-area approach provides a useful compromise giving reasonable resolution both at the peaks and on the tails of continuous distributions, but (like the fixed-count method) does not handle discrete distributions very well, since the density estimate can’t go to zero inside the range of the data.

I made up some test data and tried the different approaches on a number of test cases. Here are a couple of illustrative plots using 3000 points, 1000 each from 2 Gaussian distributions and 1000 from a Weibull (extreme-value) distribution):

Plot of the real density function and the reconstructed ones using each of the approaches to setting bin boundaries. I used a log scale on the y axis so that the tails of the distribution could be analyzed (something I often do when looking at a distribution to estimate p-values). Note that the fixed-width bins get very noisy once the expected number of counts per bin gets below 1, and the fixed-count method has an extremely coarse estimate for the right-hand tail, but the fixed-area estimate is pretty good.

Only the fixed-width estimate drops off as fast as the narrow Gaussian peaks, but it goes all the way to 0. If we used pseudocounts to prevent estimates of 0 probability, then the fixed-width method would have a minimum value, determined by the pseudocount (a pseudocount of 1 would give a minimum density of about 1.6E-3 for this data, about where the single-count bins on the right-hand tail are).

Here is a detailed look at the narrow central Gaussian peak. Because I’m interested in the peak here, rather than the tail, I used a linear scale on the y axis. Not that the fixed-width bins are too wide here, broadening the peak and making it difficult to see as a Gaussian. The fixed-count methods have a very fine resolution—too fine really, as they are picking up a lot of sampling noise. The fixed-area method seems to do the cleanest job of picking up the shape of the peak without excessive noise.

I would release the code on my web site at work, except that we had yet another power failure on campus last night (Friday night seems to be the popular time for power failures and server failures), and the file server where I plan to keep the file will not be rebooted until Monday.

Update 2013 July 25: The file is now available at http://users.soe.ucsc.edu/~karplus/pluck-scripts/density-function.py and I’ve added it to this post:

#!/usr/bin/env python2.7
"""
Thu Jul 18 01:36:21 PDT 2013 Kevin Karplus

The density-function.py file is mainly intended for use as a module
(rather that as a standalone program).
The main entry points are
cumulative  converts a list of sortable items into an iterator over
sorted pairs of (item, cumulative count)

binned_cumulative

converts a list of numbers into a sorted list of
(threshold, cumulative count) where the thresholds are
bin boundaries (not equal to any numbers on the list)

The first pair has a cumulative count of 0 and a
threshold less than any number on the input list. The
last pair has a cumulative count equal to the length
of the input list and a threshold larger than any
element of the list.

Users specify the number of bins, and get
(approximately) one more pair on the list than the
specified numer.

Different binning methods can be specified:

width   fixed-width bins
count   fixed-count bins
tapered fixed-count bins with smaller counts
near the beginning and end
area    fixed width*count bins

density_function
takes the output of binned_cumulative
and produces a plottable series of pairs
(threshold, probability density)

Output format may be

steps   two points per bin to make step-wise function
lines   linear interpolation between bin centers

As a stand-alone program, the file converts a list of numbers into a
table of pairs that can be plotted as a probability density function.

The gnuplot command
plot '<density-function -c 1 -n 6 < example_file ' with lines
would plot a 6-bin density function estimate from the second column of a file
called "example file"

"""

# to ensure compatibility with python3
from __future__ import absolute_import, division, generators, unicode_literals, print_function, nested_scopes, with_statement

import sys
import argparse
from math import sqrt
from itertools import izip,islice

def cumulative(numbers):
"""Takes a list of numbers and yields (x,cum_count) pairs representing
the cumulative counts of numbers <= x.
The set of first values of the pairs is exactly the set of numbers.
(Actually, list can have any sortable items, not just numbers.)
"""
if len(numbers)==0: return
sorted_samples = sorted(numbers)

old_x = None
for i,x in enumerate(sorted_samples):
if x!=old_x:
if old_x is not None:
yield (old_x, i)
old_x = x
yield (x,len(numbers))

def binned_cumulative(numbers, num_bins=None, method="area"):
"""Takes a list of numbers and returns a sorted list of (x,cum_count) pairs representing
the cumulative counts of numbers < x.
An extra pair is included at each end (with 0 count difference from the real ends), projecting
an approximate end point out from the real ends.
The first values of the pairs are between values of numbers.
The list is thinned to try to get num_bins+1 pairs.
"""
count = len(numbers)
if num_bins is None:
num_bins = int(sqrt(count)+1)
assert(num_bins>0)

cum_pairs = [ x for x in cumulative(numbers)]
if len(cum_pairs)==0:
return [(0,0), (1,0)]
first_x = cum_pairs[0][0]
if len(cum_pairs)==1:
return [(first_x-0.5,0), (first_x+0.5,cum_pairs[0][1])]

# project out bin boundaries past real data, using first 2 and last values
cum_pairs.insert(0,  tuple( (1.5*first_x-0.5*cum_pairs[1][0],  0)) )
cum_pairs.append( tuple( (1.5*cum_pairs[-1][0]-0.5*cum_pairs[-2][0], count)) )

if num_bins==1:
return [ cum_pairs[0], cum_pairs[-1] ]

total_width = cum_pairs[-1][0] - cum_pairs[0][0]

if method=="width":
# use fixed-width bins (total interval/num_bins)
counts = [0]*num_bins
bin_width = total_width/num_bins
bin_scale = num_bins/total_width
start=cum_pairs[0][0]
oldc=0
for x,c in cum_pairs:
subscript=int( (x-start)*bin_scale )
if (subscript<num_bins):    # avoid rounding error on last, empty count
counts[ subscript ] += c-oldc
oldc=c
for i in xrange(1,num_bins):
counts[i] += counts[i-1]
return [(start,0)] +[(start+(k+1)*bin_width,c) for k,c in enumerate(counts)]

# For methods other than fixed-width bins, we currently have no
# way of producing empty bins, so the number of bins is at most
# the number different values in "numbers"
if num_bins>len(cum_pairs)-2:
num_bins=len(cum_pairs)-2

if method=="count":
# Use bins that are approximately equal counts.
# This method does a low-to-high sweep setting boundaries,
# which may result in target counts getting lower towards the end,
# as earlier boundaries overshoot their target counts.
remaining_count = count
remaining_bins = num_bins
cum_to_find = remaining_count/remaining_bins

thinned = [cum_pairs[0]]    # zero count at beginning
for x,y in izip(cum_pairs,islice(cum_pairs,1,None)):
if x[1]>= cum_to_find:
thinned.append( (  (x[0]+y[0])/2,   x[1] ) )
remaining_count = count-x[1]
remaining_bins -=1
if remaining_bins==0: break
cum_to_find = x[1] + remaining_count/remaining_bins
#    print("DEBUG: cum_pairs=",cum_pairs, file=sys.stderr)
if thinned[-1][1] == cum_pairs[-1][1]:
# the last bin covered all counts,
# but may not include the extension
thinned = thinned[:-1]
thinned.append(cum_pairs[-1])
return thinned

if method=="tapered":
# This method is like "count" but tapers the bin sizes towards the ends

approx_bin_count = count/(num_bins-1)       # size for middle bins

# num_end_bins is how many bins on each end to ramp up size over.
# The total count for the first num_end_bins is about half the middle bins.
# (Same for the last num_end_bins)
num_end_bins = min(num_bins//10, (len(cum_pairs)-num_bins)//2)

if num_end_bins==0:
bin_size = count/num_bins
cum_to_find = [round(i*bin_size) for i in xrange(1,num_bins+1)]
else:
effective_num_bins = num_bins-2*num_end_bins +1
bin_size = count/effective_num_bins         # average count for middle bins

first_size = bin_size/num_end_bins

cum_to_find = []
cum=0
size = first_size
# ramp up from the beginning
cum_to_find = [ int(round(bin_size/(num_end_bins+1) *i*(i+1)/2)) for i in xrange(1,num_end_bins+1)]
# fill in the middle
cum_so_far = cum_to_find[-1]
from_end = [count] + [count-x for x in cum_to_find[0:num_end_bins]]

middle_bin_size = (count-2*cum_so_far)/ (num_bins-2*num_end_bins)
# print("DEBUG: cum_so_far=", cum_so_far, " middle_bin_size=",middle_bin_size, file=sys.stderr)
cum_to_find.extend( [ int(round(middle_bin_size*(i-num_end_bins+1)+cum_so_far))
for i in xrange(num_end_bins, num_bins-num_end_bins-1)])

# make the second half by reversing the first half and counting from the end
# print("DEBUG: from_end=", from_end, " len(cum_to_find)=",len(cum_to_find), file=sys.stderr)
cum_to_find.extend(from_end[::-1])
# print("DEBUG: num_bins=", num_bins, " len(cum_to_find)=",len(cum_to_find), " cum_to_find=",cum_to_find, file=sys.stderr)

thinned = [cum_pairs[0]]    # zero count at beginning
bin=0
for x,y in izip(cum_pairs,islice(cum_pairs,1,None)):
if x[1]>= cum_to_find[bin]:
# print("DEBUG: x=",x," y=",y, file=sys.stderr)
thinned.append( (  (x[0]+y[0])/2,   x[1] ) )
bin+=1
#    print("DEBUG: cum_pairs=",cum_pairs, file=sys.stderr)
if thinned[-1][1] == cum_pairs[-1][1]:
# the last bin covered all counts, but may not include the extension
thinned = thinned[:-1]
thinned.append(cum_pairs[-1])
return thinned

if method=="area":
#  This method scales the bins so that
#  the product of the count and the binwidth are roughly constant.

remaining_area = (cum_pairs[-1][0] - cum_pairs[0][0])*count
remaining_bins = num_bins
thinned = [cum_pairs[0]]    # zero count at beginning
for x,y in izip(cum_pairs,islice(cum_pairs,1,None)):
boundary = (x[0]+y[0])/2
bin_count = x[1]-thinned[-1][1]
width = boundary - thinned[-1][0]
if bin_count*width >= remaining_area/(remaining_bins*remaining_bins):
thinned.append( ( boundary,   x[1] ) )
remaining_area = (cum_pairs[-1][0] - boundary)*(count-x[1])
remaining_bins -= 1
if remaining_bins == 0: break

if thinned[-1][1] == cum_pairs[-1][1]:
# the last bin covered all counts,
# but may not include the extension
thinned = thinned[:-1]
thinned.append(cum_pairs[-1])
#    print("DEBUG: num_bins=",num_bins," len(thinned)=",len(thinned), file=sys.stderr)
return thinned

def density_function(cum_pairs,smoothing="steps"):
"""
This is a generator that yields points.
Converts a cumulative pair list [ (x0,0) ... (xn,total_count)]
into a probability density function for plotting.
Note: x0< x1< ... <xn required.

Output can be steps or lines between bin centers.
"""
assert(len(cum_pairs)>=2)
count = cum_pairs[-1][1]
if count<=0:
return

if smoothing=="steps" or len(cum_pairs)==2:
yield (cum_pairs[0][0], 0)
else:
yield ( 1.5*cum_pairs[0][0] - 0.5*cum_pairs[1][0], 0)

for old_pair,pair in izip(cum_pairs,islice(cum_pairs,1,None)):
old_threshold = old_pair[0]
threshold = pair[0]
level = (pair[1]-old_pair[1])/(count*(threshold-old_threshold))
if smoothing=="steps":
yield (old_threshold,  level)
yield (threshold,  level)
else:
yield ( (threshold+old_threshold)/2, level)
if smoothing=="steps" or len(cum_pairs)==2:
yield (cum_pairs[-1][0],0)
else:
yield ( 1.5*cum_pairs[0][-1] - 0.5*cum_pairs[1][-2], 0)

# ---------------------------------------------------------
# Below this line are functions primarily for testing or using the
# module as a stand-alone program.

def positive_int(string):
"""Type converter for argparse allowing only int > 0 """
value = int(string)
if value<=0:
msg = "{} is not a positive integer".format(string)
raise argparse.ArgumentTypeError(msg)
return value

def parse_args(argv):
"""parse the command-line options.
Returning all as members of a class
"""
parser = argparse.ArgumentParser( description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)
parser.set_defaults(   column=1
, num_bins=None
, method="area"
, smoothing="steps"
)
help="""Which (white-space separated) column of each line to read.
One-based indexing.
""")
help="""Number of bins to use for "tapered" variant.
Approximate number of bins for "area" variant.
Default is sqrt(count)+1.
""")
choices=["width","count","tapered","area"],
help="""Different algorithms for choosing bin widths:
width     fixed-width bins (the classic method for histograms)
count     roughly fixed-count bins, giving finer resolution where
the probability density is higher.
tapered   has roughly equal counts in the middle,
but reduces the counts towards the two ends,
to get better resolution in the tails, where counts are low
area      has roughly equal count*bin_width throughout,
providing good resolution in both high-density
and low-density
Default is area.
""")
choices=["steps","lines"],
help="""Different ways of output the density function:
steps   step for each bin (two points per bin)
lines   straight lines between bin centers
""")
return parser.parse_args(argv[1:])

def column(file_obj, col_num=0):
"""yields one number from each line of a file, ignoring blank
lines or comment lines whose first non-white-space is #
Columns are numbered with zero-based indexing.
"""
for line in file_obj:
line = line.strip()
if line=="" or line.startswith("#"):
continue
fields = line.split()
yield float(fields[col_num])

def main(args):
"""Example of using the density_function and binned_cumulative functions.
This function (and the parse_args function) are not used when density_function.py is used as module.
"""
options=parse_args(sys.argv)
numbers = [x for x in column(sys.stdin,options.column-1)]

cum_bins = binned_cumulative(numbers,
num_bins=options.num_bins,method=options.method)
# print ("DEBUG: cum_bins=",cum_bins,file=sys.stderr)
for x,cum in density_function(cum_bins,smoothing=options.smoothing):
print (x, "\t", cum)

if __name__ == "__main__" :
try :
sys.exit(main(sys.argv))
except EnvironmentError as (errno,strerr):
sys.stderr.write("ERROR: " + strerr + "\n")
sys.exit(errno)


## 2012 September 30

### Bad question for math classes

Filed under: Uncategorized — gasstationwithoutpumps @ 11:02
Tags: , , ,

Mr. K, a math teacher, in his blog post Math Stories : Quick check for understanding, suggested asking students the following question:

Pick two numbers, randomly. What is the probability that the product will be even?

This question illustrates the standard problem with the way high schools teach (or rather don’t teach) probability, a problem that still plagues scientists even after they are in grad school. The problem is that “picking randomly” is not an adequate description of a process, as no distribution is given. For example, I can pick two numbers randomly and always get an odd number, if the distribution I pick from has zero probability on the evens.

There are other problems with the question, like what Mr. K meant by “numbers”, but we’ll be generous and assume that in his classes “number” means “integer”, since the odd/even question otherwise makes no sense. It is also possible that Mr. K meant “decimal digit”, in which case the complaint I’m making below does not apply, and I’d have to change my critique to the sloppiness of using “number” to mean “digit”, but in my experience math teachers are fairly careful about the number/digit distinction, but sloppy about the word “randomly”, and so I will continue as if the problem is with the use of “randomly”. (Aside: that last, overly long sentence is an informal version of Bayesian reasoning.)

Often, when no distribution is given, people mean that they used the uniform distribution.  More specifically, they usually mean an independent, identically distributed process (i.i.d) drawn from a uniform distribution, that is, one where each item is chosen independently from the same underlying distribution, which happens to be a uniform distribution.  This is a mathematically easy choice to analyze, though it is often a poor choice for a null model. Bayesian statistics was panned for centuries, because of the poor choice of uniform priors in the initial development of the methods.

But for a countably infinite set like the integers, a uniform distribution can’t be defined—some numbers must be more probable than others or you can’t get the probabilities to sum to one over all integers. (You have the same problem with integrals if you choose the real or complex numbers.)  That means the standard meaning of “pick randomly” doesn’t exist, and the question is badly formed.

Mr. K probably meant for students to pick integers randomly from a distribution in which odd and even numbers were equally likely, in which case his question becomes trivial, with the probability of an even product being ¾.  We can even go a step further and say that if the probability of choosing an odd number in the underlying distribution is p, then the probability that the product is odd is p2.

The important part of the question is not the properties of odd and even numbers or of multiplying probabilities of independent events, which I believe are the understandings that Mr. K was interested in, but the importance of knowing what distribution you are drawing from, which I don’t believe Mr. K gave any thought to.

The same problem comes up all the time when biologists (and, presumably, other scientists) try to use p-values to establish statistical significance.  The p-value is thought of as the probability that the observed data could have arisen “by chance”, which is as bad a phrase as “picking randomly”. The crucial question is what that “chance” process is—for what distribution are you computing the probability?  This is the “null model” or “null hypothesis” and for the p-value to be at all interpretable, the null model must include every phenomenon other than the hypothesis you are testing.  That is, your null hypothesis is as complicated as your hypothesis, but leaves out one (hopefully crucial) idea.

Most scientists choose a mathematically tractable model that is unlikely to fit any real data (like the i.i.d. with a uniform distribution).  They then show that this ridiculously simple model can’t explain the data, and jump from there to the conclusion that their hypothesis must be correct (a leap of faith more suitable for theology than science).  All that they have really shown is that the stupid model is wrong, which everyone already knew.  (Except sometimes the data is so meager that even the stupid, obviously wrong model can’t be ruled out by frequentist techniques, but those papers rarely get published.)

A large fraction of the papers I read or referee have badly designed null models, often invalidating their conclusions, so I believe that this sloppiness about the use of the word “randomly” is a serious problem in the education of future generations of scientists.  And it isn’t hard to rewrite questions to avoid the sloppiness by being explicit about the random process:

• Pick two integers from a distribution in which odd and even numbers equally likely.
• Pick two numbers uniformly with replacement from the integers from 1 to 10.
• Roll a fair 20-sided die twice.
• Draw a card from a deck containing just ace through 10 cards (no face cards).  Replace the card, shuffle the deck, and draw again.

For the age group that Mr. K teaches, the dice or card processes are probably the best to use, as the vocabulary used by probability theorists is more likely to confuse than to clarify.