In the senior thesis writing course, I suggested to the class that they replace the histograms that several students were using with kernel density estimates, as a better way to approximate the underlying probability distribution. Histograms are designed to be easy to make by hand, not to convey the best possible estimate or picture of the probability density function. Now that we have computers to draw our graphs for us, we can use computational techniques that are too tedious to do by hand, but that provide better graphs: both better looking and less prone to artifacts.

The basic idea of kernel density estimation is simple: every data point is replaced by a narrow probability density centered at that point, and all the probability densities are averaged. The narrow probability density function is called the *kernel*, and we are estimating a probability density function for the data, hence the name *kernel density estimation. *The most commonly used kernel is a Gaussian distribution, which has two parameters: µ and *σ. *The mean µ is set to the data point, leaving the standard deviation σ as a parameter that can be used to control the estimation. If σ is made large, then the kernels are very wide, and the overall density estimate will be very smooth and slowly changing. If σ is made small, then the kernels are narrow, and the density estimate will follow the data closely.

The scipy Python package has a built-in function for creating kernel density estimates from a list or numpy array of data (in any number of dimensions). I used this function to create some illustrative plots of the differences between histograms and kernel density estimates.

With even more data points from the simulation, the right-hand tail can be seen to be well approximated by a single exponential (a straight line on these semilog plots), so the kernel density estimates are doing a very good job of extrapolating the probability density estimates down to the region where there is only one data point every 10 integers.

Here is the source code I used to create the plots. Note that the squishing requires a compensation to the output of the kernel density computation to produce a probability density function that integrates to 1 on the original data space.

#!/usr/bin/env python3 """ Reads a histogram from stdin and outputs a smoothed probability density function to stdout using Gaussian kernel density estimation Input format: # comment lines are ignored First two columns are numbers: value number_of_instances remaining columns are ignored. Output format three columns: value p(value) integral(x>=value)p(x) """ from __future__ import division, print_function from scipy import stats import numpy as np import sys import itertools import matplotlib import matplotlib.pyplot as plt # values and counts are input histogram, with counts[i] instances of values[i] values = [] counts = [] for line in sys.stdin: line=line.strip() if not line: continue if line.startswith("#"): continue fields = line.split() counts.append(int(fields[1])) values.append(float(fields[0])) counts=np.array(counts) values=np.array(values) squish_shift = 40. # amount to shift data before taking log when squishing def squish(data): """distortion function to make binning correspond better to density""" return np.log(data+squish_shift) def dsquish(data): """derivative of squish(data)""" return 1./(data+squish_shift) instances = np.fromiter(itertools.chain.from_iterable( [value]*num for value, num in zip(values,counts)), float) squish_instances = np.fromiter(itertools.chain.from_iterable( [squish(value)]*num for value, num in zip(values,counts)), float) num_points = len(squish_instances) # print("DEBUG: instances shape=", instances.shape, file=sys.stderr) min_v = min(values) max_v = max(values) squish_smoothed = stats.gaussian_kde(squish_instances) smoothed = stats.gaussian_kde(instances) step_size=0.5 grid = np.arange(max(step_size,min_v-10), max_v+10, step_size) # print("DEBUG: grid=",grid, file=sys.stderr) plt.xlabel("Length of longest ORF") plt.ylabel("Probability density") plt.title("Esitmates of probability density functions") plt.ylim(0.01/num_points, 0.1) plt.semilogy(values, counts/num_points, linestyle='None',marker=".", label="histogram bin_width=1") plt.semilogy(grid,squish_smoothed(squish(grid))*dsquish(grid), label="squished kde") plt.semilogy(grid,smoothed(grid), label="kde") num_bins = int(5*num_points**0.25) plt.hist(values, weights=counts, normed=True,log=True, bins=num_bins, label="histogram {} bins".format(num_bins)) plt.legend(loc="upper right") plt.show()

I must admit that python’s absence of end command for loops gives me serious anxiety.

Comment by xykademiqz — 2015 January 24 @ 10:56 |

I read a lot of Python programs by programmers new to the language when grading for my bioinformatics course, and errors due to mistakes about where loops end (indenting errors) are relatively rare. Incorrect loop embedding seems to be no more common than in C++ or Pascal, both of which have explicit marks for ends of blocks.

The errors I see in Python code are rarely related to the syntax—they are problems of students not being able to subdivide a problem reasonably or not coming up with good data structures to represent their data. The inability to subdivide a problem is almost independent of programming language, and students seem to choose somewhat better data structures in Python than in C++ or Java, because it is less work to create the data structures.

Comment by gasstationwithoutpumps — 2015 January 24 @ 11:07 |

The inability to subdivide a problem is almost independent of programming language,Truer words have seldom been spoken.

Comment by xykademiqz — 2015 January 24 @ 11:12 |

The seaborn package has good kernel density estimates as well, under seaborn.kdeplot. Seaborn in general is a great visualization package. Here is an example: http://stanford.edu/~mwaskom/software/seaborn/examples/distplot_options.html

Comment by Jacob Schreiber — 2015 January 24 @ 21:59 |

[…] Use kernel density estimates instead of histograms when showing empirical probability distributions. My previous post explains the reasons. […]

Pingback by More senior thesis pet peeves | Gas station without pumps — 2015 January 26 @ 22:17 |