In the senior thesis writing course, I suggested to the class that they replace the histograms that several students were using with kernel density estimates, as a better way to approximate the underlying probability distribution. Histograms are designed to be easy to make by hand, not to convey the best possible estimate or picture of the probability density function. Now that we have computers to draw our graphs for us, we can use computational techniques that are too tedious to do by hand, but that provide better graphs: both better looking and less prone to artifacts.

The basic idea of kernel density estimation is simple: every data point is replaced by a narrow probability density centered at that point, and all the probability densities are averaged. The narrow probability density function is called the *kernel*, and we are estimating a probability density function for the data, hence the name *kernel density estimation. *The most commonly used kernel is a Gaussian distribution, which has two parameters: µ and *σ. *The mean µ is set to the data point, leaving the standard deviation σ as a parameter that can be used to control the estimation. If σ is made large, then the kernels are very wide, and the overall density estimate will be very smooth and slowly changing. If σ is made small, then the kernels are narrow, and the density estimate will follow the data closely.

The scipy Python package has a built-in function for creating kernel density estimates from a list or numpy array of data (in any number of dimensions). I used this function to create some illustrative plots of the differences between histograms and kernel density estimates.

This plot has 2 histograms and two kernel density estimates for a sample of 100,000 points. The blue dots are a histogram with bin width 1, and the bar graph uses bins slightly narrower than 5. The red line is the smooth curve from using Gaussian kernel density estimation, and the green curve results from Gaussian kernel density estimation on transformed data (ln(x+40.)) Note that the kde plots are smoother than the histograms, and less susceptible to boundary artifacts (most of the almost-5-wide bins contain 5 integers, but some have only 4). The rescaling before computing the kde causes the bins to be wider for large x values, where there are fewer data points.

With only 1000 points, the histograms get quite crude, but kde estimates are still quite good, particularly the “squished kde” which rescales the x axis before applying the kernel density estimate.

With even more data points from the simulation, the right-hand tail can be seen to be well approximated by a single exponential (a straight line on these semilog plots), so the kernel density estimates are doing a very good job of extrapolating the probability density estimates down to the region where there is only one data point every 10 integers.

Here is the source code I used to create the plots. Note that the squishing requires a compensation to the output of the kernel density computation to produce a probability density function that integrates to 1 on the original data space.

#!/usr/bin/env python3
""" Reads a histogram from stdin
and outputs a smoothed probability density function to stdout
using Gaussian kernel density estimation
Input format:
# comment lines are ignored
First two columns are numbers:
value number_of_instances
remaining columns are ignored.
Output format three columns:
value p(value) integral(x>=value)p(x)
"""
from __future__ import division, print_function
from scipy import stats
import numpy as np
import sys
import itertools
import matplotlib
import matplotlib.pyplot as plt
# values and counts are input histogram, with counts[i] instances of values[i]
values = []
counts = []
for line in sys.stdin:
line=line.strip()
if not line: continue
if line.startswith("#"): continue
fields = line.split()
counts.append(int(fields[1]))
values.append(float(fields[0]))
counts=np.array(counts)
values=np.array(values)
squish_shift = 40. # amount to shift data before taking log when squishing
def squish(data):
"""distortion function to make binning correspond better to density"""
return np.log(data+squish_shift)
def dsquish(data):
"""derivative of squish(data)"""
return 1./(data+squish_shift)
instances = np.fromiter(itertools.chain.from_iterable( [value]*num for value, num in zip(values,counts)), float)
squish_instances = np.fromiter(itertools.chain.from_iterable( [squish(value)]*num for value, num in zip(values,counts)), float)
num_points = len(squish_instances)
# print("DEBUG: instances shape=", instances.shape, file=sys.stderr)
min_v = min(values)
max_v = max(values)
squish_smoothed = stats.gaussian_kde(squish_instances)
smoothed = stats.gaussian_kde(instances)
step_size=0.5
grid = np.arange(max(step_size,min_v-10), max_v+10, step_size)
# print("DEBUG: grid=",grid, file=sys.stderr)
plt.xlabel("Length of longest ORF")
plt.ylabel("Probability density")
plt.title("Esitmates of probability density functions")
plt.ylim(0.01/num_points, 0.1)
plt.semilogy(values, counts/num_points, linestyle='None',marker=".", label="histogram bin_width=1")
plt.semilogy(grid,squish_smoothed(squish(grid))*dsquish(grid), label="squished kde")
plt.semilogy(grid,smoothed(grid), label="kde")
num_bins = int(5*num_points**0.25)
plt.hist(values, weights=counts, normed=True,log=True, bins=num_bins,
label="histogram {} bins".format(num_bins))
plt.legend(loc="upper right")
plt.show()

### Like this:

Like Loading...