# Gas station without pumps

## 2015 January 22

### Kernel density estimates

Filed under: Uncategorized — gasstationwithoutpumps @ 22:29
Tags: , ,

In the senior thesis writing course, I suggested to the class that they replace the histograms that several students were using with kernel density estimates, as a better way to approximate the underlying probability distribution.  Histograms are designed to be easy to make by hand, not to convey the best possible estimate or picture of the probability density function. Now that we have computers to draw our graphs for us, we can use computational techniques that are too tedious to do by hand, but that provide better graphs: both better looking and less prone to artifacts.

The basic idea of kernel density estimation is simple: every data point is replaced by a narrow probability density centered at that point, and all the probability densities are averaged.  The narrow probability density function is called the kernel, and we are estimating a probability density function for the data, hence the name kernel density estimation.  The most commonly used kernel is a Gaussian distribution, which has two parameters: µ and σ. The mean µ is set to the data point, leaving the standard deviation σ as a parameter that can be used to control the estimation.  If σ is made large, then the kernels are very wide, and the overall density estimate will be very smooth and slowly changing. If σ is made small, then the kernels are narrow, and the density estimate will follow the data closely.

The scipy Python package has a built-in function for creating kernel density estimates from a list or numpy array of data (in any number of dimensions). I used this function to create some illustrative plots of the differences between histograms and kernel density estimates. This plot has 2 histograms and two kernel density estimates for a sample of 100,000 points. The blue dots are a histogram with bin width 1, and the bar graph uses bins slightly narrower than 5. The red line is the smooth curve from using Gaussian kernel density estimation, and the green curve results from Gaussian kernel density estimation on transformed data (ln(x+40.)) Note that the kde plots are smoother than the histograms, and less susceptible to boundary artifacts (most of the almost-5-wide bins contain 5 integers, but some have only 4). The rescaling before computing the kde causes the bins to be wider for large x values, where there are fewer data points. With only 1000 points, the histograms get quite crude, but kde estimates are still quite good, particularly the “squished kde” which rescales the x axis before applying the kernel density estimate.

With even more data points from the simulation, the right-hand tail can be seen to be well approximated by a single exponential (a straight line on these semilog plots), so the kernel density estimates are doing a very good job of extrapolating the probability density estimates down to the region where there is only one data point every 10 integers.

Here is the source code I used to create the plots. Note that the squishing requires a compensation to the output of the kernel density computation to produce a probability density function that integrates to 1 on the original data space.

```#!/usr/bin/env python3

""" Reads a histogram from stdin
and outputs a smoothed probability density function to stdout
using Gaussian kernel density estimation

Input format:
# comment lines are ignored
First two columns are numbers:
value	number_of_instances
remaining columns are ignored.

Output format three columns:
value	 p(value)  integral(x>=value)p(x)
"""

from __future__ import division, print_function

from scipy import stats
import numpy as np
import sys
import itertools
import matplotlib
import matplotlib.pyplot as plt

# values and counts are input histogram, with counts[i] instances of values[i]
values = []
counts = []
for line in sys.stdin:
line=line.strip()
if not line: continue
if line.startswith("#"): continue
fields = line.split()
counts.append(int(fields))
values.append(float(fields))

counts=np.array(counts)
values=np.array(values)

squish_shift = 40. # amount to shift data before taking log when squishing

def squish(data):
"""distortion function to make binning correspond better to density"""
return np.log(data+squish_shift)

def dsquish(data):
"""derivative of squish(data)"""
return 1./(data+squish_shift)

instances = np.fromiter(itertools.chain.from_iterable( [value]*num for value, num in zip(values,counts)), float)
squish_instances = np.fromiter(itertools.chain.from_iterable( [squish(value)]*num for value, num in zip(values,counts)), float)
num_points = len(squish_instances)

# print("DEBUG: instances shape=", instances.shape, file=sys.stderr)

min_v = min(values)
max_v = max(values)

squish_smoothed = stats.gaussian_kde(squish_instances)
smoothed = stats.gaussian_kde(instances)

step_size=0.5
grid = np.arange(max(step_size,min_v-10), max_v+10, step_size)

# print("DEBUG: grid=",grid, file=sys.stderr)

plt.xlabel("Length of longest ORF")
plt.ylabel("Probability density")
plt.title("Esitmates of probability density functions")

plt.ylim(0.01/num_points, 0.1)
plt.semilogy(values, counts/num_points, linestyle='None',marker=".", label="histogram bin_width=1")
plt.semilogy(grid,squish_smoothed(squish(grid))*dsquish(grid), label="squished kde")
plt.semilogy(grid,smoothed(grid), label="kde")
num_bins = int(5*num_points**0.25)
plt.hist(values, weights=counts, normed=True,log=True, bins=num_bins,
label="histogram {} bins".format(num_bins))

plt.legend(loc="upper right")
plt.show()

```

1. I must admit that python’s absence of end command for loops gives me serious anxiety.

Comment by xykademiqz — 2015 January 24 @ 10:56

• I read a lot of Python programs by programmers new to the language when grading for my bioinformatics course, and errors due to mistakes about where loops end (indenting errors) are relatively rare. Incorrect loop embedding seems to be no more common than in C++ or Pascal, both of which have explicit marks for ends of blocks.

The errors I see in Python code are rarely related to the syntax—they are problems of students not being able to subdivide a problem reasonably or not coming up with good data structures to represent their data. The inability to subdivide a problem is almost independent of programming language, and students seem to choose somewhat better data structures in Python than in C++ or Java, because it is less work to create the data structures.

Comment by gasstationwithoutpumps — 2015 January 24 @ 11:07

• The inability to subdivide a problem is almost independent of programming language,

Truer words have seldom been spoken.

Comment by xykademiqz — 2015 January 24 @ 11:12

2. The seaborn package has good kernel density estimates as well, under seaborn.kdeplot. Seaborn in general is a great visualization package. Here is an example: http://stanford.edu/~mwaskom/software/seaborn/examples/distplot_options.html

Comment by Jacob Schreiber — 2015 January 24 @ 21:59

3. […] Use kernel density estimates instead of histograms when showing empirical probability distributions. My previous post explains the reasons. […]

Pingback by More senior thesis pet peeves | Gas station without pumps — 2015 January 26 @ 22:17

This site uses Akismet to reduce spam. Learn how your comment data is processed.