In the senior thesis writing course, I suggested to the class that they replace the histograms that several students were using with kernel density estimates, as a better way to approximate the underlying probability distribution. Histograms are designed to be easy to make by hand, not to convey the best possible estimate or picture of the probability density function. Now that we have computers to draw our graphs for us, we can use computational techniques that are too tedious to do by hand, but that provide better graphs: both better looking and less prone to artifacts.

The basic idea of kernel density estimation is simple: every data point is replaced by a narrow probability density centered at that point, and all the probability densities are averaged. The narrow probability density function is called the *kernel*, and we are estimating a probability density function for the data, hence the name *kernel density estimation. *The most commonly used kernel is a Gaussian distribution, which has two parameters: µ and *σ. *The mean µ is set to the data point, leaving the standard deviation σ as a parameter that can be used to control the estimation. If σ is made large, then the kernels are very wide, and the overall density estimate will be very smooth and slowly changing. If σ is made small, then the kernels are narrow, and the density estimate will follow the data closely.

The scipy Python package has a built-in function for creating kernel density estimates from a list or numpy array of data (in any number of dimensions). I used this function to create some illustrative plots of the differences between histograms and kernel density estimates.

With even more data points from the simulation, the right-hand tail can be seen to be well approximated by a single exponential (a straight line on these semilog plots), so the kernel density estimates are doing a very good job of extrapolating the probability density estimates down to the region where there is only one data point every 10 integers.

Here is the source code I used to create the plots. Note that the squishing requires a compensation to the output of the kernel density computation to produce a probability density function that integrates to 1 on the original data space.

#!/usr/bin/env python3 """ Reads a histogram from stdin and outputs a smoothed probability density function to stdout using Gaussian kernel density estimation Input format: # comment lines are ignored First two columns are numbers: value number_of_instances remaining columns are ignored. Output format three columns: value p(value) integral(x>=value)p(x) """ from __future__ import division, print_function from scipy import stats import numpy as np import sys import itertools import matplotlib import matplotlib.pyplot as plt # values and counts are input histogram, with counts[i] instances of values[i] values = [] counts = [] for line in sys.stdin: line=line.strip() if not line: continue if line.startswith("#"): continue fields = line.split() counts.append(int(fields[1])) values.append(float(fields[0])) counts=np.array(counts) values=np.array(values) squish_shift = 40. # amount to shift data before taking log when squishing def squish(data): """distortion function to make binning correspond better to density""" return np.log(data+squish_shift) def dsquish(data): """derivative of squish(data)""" return 1./(data+squish_shift) instances = np.fromiter(itertools.chain.from_iterable( [value]*num for value, num in zip(values,counts)), float) squish_instances = np.fromiter(itertools.chain.from_iterable( [squish(value)]*num for value, num in zip(values,counts)), float) num_points = len(squish_instances) # print("DEBUG: instances shape=", instances.shape, file=sys.stderr) min_v = min(values) max_v = max(values) squish_smoothed = stats.gaussian_kde(squish_instances) smoothed = stats.gaussian_kde(instances) step_size=0.5 grid = np.arange(max(step_size,min_v-10), max_v+10, step_size) # print("DEBUG: grid=",grid, file=sys.stderr) plt.xlabel("Length of longest ORF") plt.ylabel("Probability density") plt.title("Esitmates of probability density functions") plt.ylim(0.01/num_points, 0.1) plt.semilogy(values, counts/num_points, linestyle='None',marker=".", label="histogram bin_width=1") plt.semilogy(grid,squish_smoothed(squish(grid))*dsquish(grid), label="squished kde") plt.semilogy(grid,smoothed(grid), label="kde") num_bins = int(5*num_points**0.25) plt.hist(values, weights=counts, normed=True,log=True, bins=num_bins, label="histogram {} bins".format(num_bins)) plt.legend(loc="upper right") plt.show()