How can you create a KDE from histogram values only?

Question

I have a set of values that I'd like to plot the gaussian kernel density estimation of, however there are two problems that I'm having:

I only have the values of bars not the values themselves
I am plotting onto a categorical axis

Here's the plot I've generated so far: The order of the y axis is actually relevant since it is representative of the phylogeny of each bacterial species.

I'd like to add a gaussian kde overlay for each color, but so far I haven't been able to leverage seaborn or scipy to do this.

Here's the code for the above grouped bar plot using python and matplotlib:

enterN = len(color1_plotting_values)
fig, ax = plt.subplots(figsize=(20,30))
ind = np.arange(N)    # the x locations for the groups
width = .5         # the width of the bars
p1 = ax.barh(Species_Ordering.Species.values, color1_plotting_values, width, label='Color1', log=True)
p2 = ax.barh(Species_Ordering.Species.values, color2_plotting_values, width, label='Color2', log=True)
for b in p2:
    b.xy = (b.xy[0], b.xy[1]+width)

Thanks!

It looks like you're pulling from a dataframe, have you tried the built in [kde plotting functionality](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.kde.html)? — G. Anderson, Dec 17 '18 at 23:14
Yes, I've tried, but I do not know how to have it interpret the categorical axis properly. the resulting kde is a kde of the histogram of the data. However, the data already represents the heights of histogram bars. Think of each bacterial species as a bin and each number as a count of values in that bin. Hope that helps show how the data is formatted! — Joe B, Dec 18 '18 at 00:05
KDE generally involves integration over neighboring data points. For categorical data such as your different species there is no objective distance criterion (much less one that respects the triangle inequality). Using KDE here is hence neither possible nor desirable. — Paul Brodersen, Dec 18 '18 at 10:19
@PaulBrodersen sorry to intrude, let's say we forget data is categorical and we look at it just as an histogram with equal bins, or maybe just a function on a uniformly sampled domain. Would it be possible to run KDE in such a setting? I mean without access to the samples themselves, just to the binned histogram — filippo, Dec 18 '18 at 15:47
@PaulBrodersen You are right of course that a KDE isn't ideal, but it is quite useful here as we are looking for overall trends in the 'histogram' bars. the y axis here is built from a phylogenetic tree. Phylogenetic trees do have an objective distance criterion that I didn't include above. We've only maintained the ordering of species here because the distance might add in info we don't need to be represented. — Joe B, Dec 18 '18 at 18:21
@filippo Yes! that's basically what I want to know. Can we estimate an gaussian KDE (or different KDE) without the samples? — Joe B, Dec 18 '18 at 18:22
Phylogenetic "distance" is not a [true distance in the mathematical sense](https://en.wikipedia.org/wiki/Distance#Mathematics), as it does not satisfy the triangle inequality (example [here](http://www.cs.cmu.edu/~durand/03-711/2010/Felsenstein141-171.pdf)). Therefor, you **cannot** apply a KDE here (certainly not in a figure which I assume is for an academic publication). — Paul Brodersen, Dec 19 '18 at 09:56
@filippo Sort of. In some sense, determining the KDE from a histogram is similar to KDE using weighted samples (which for most KDE methods is a simple extension). The problem is that you don't know the true position of a point within the bin edges. Therefor if the kernel width is similar to or smaller than the bin width, you run into issues (easy to see if you simulate a bunch of points on the uniform interval, apply a KDE algorithm of your choice, and then compare the result to when you round the point coordinates to say 1 significant digit). Broad kernels should be fine, though. — Paul Brodersen, Dec 19 '18 at 10:05
@JoeB There are smoothing methods that do not rely on a kernel and hence do not necessarily need a distance criterion that satisfies the triangle inequality. Inverse distance weighted k-nearest neighbor density estimation comes to mind. I have an implementation for the general case [here](https://github.com/paulbrodersen/inverse_distance_weighting). If I have time later today, I may cook up something that solves your case. In the meantime, it would help if you could post the data, i.e. the bin values, and the phylogenetic distance matrix. — Paul Brodersen, Dec 19 '18 at 10:18
@PaulBrodersen Relax. Take a look at my answer. Not always there is integration involved. — , Aug 01 '19 at 09:34

tel · Accepted Answer · 2018-12-19T01:57:24.433

How to plot a "KDE" starting from a histogram

The protocol for kernel density estimation requires the underlying data. You could come up with a new method that uses the empirical pdf (ie the histogram) instead, but then it wouldn't be a KDE distribution.

Not all hope is lost, though. You can get a good approximation of a KDE distribution by first taking samples from the histogram, and then using KDE on those samples. Here's a complete working example:

import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sts

n = 100000

# generate some random multimodal histogram data
samples = np.concatenate([np.random.normal(np.random.randint(-8, 8), size=n)*np.random.uniform(.4, 2) for i in range(4)])
h,e = np.histogram(samples, bins=100, density=True)
x = np.linspace(e.min(), e.max())

# plot the histogram
plt.figure(figsize=(8,6))
plt.bar(e[:-1], h, width=np.diff(e), ec='k', align='edge', label='histogram')

# plot the real KDE
kde = sts.gaussian_kde(samples)
plt.plot(x, kde.pdf(x), c='C1', lw=8, label='KDE')

# resample the histogram and find the KDE.
resamples = np.random.choice((e[:-1] + e[1:])/2, size=n*5, p=h/h.sum())
rkde = sts.gaussian_kde(resamples)

# plot the KDE
plt.plot(x, rkde.pdf(x), '--', c='C3', lw=4, label='resampled KDE')
plt.title('n = %d' % n)
plt.legend()
plt.show()

Output:

The red dashed line and the orange line nearly completely overlap in the plot, showing that the real KDE and the KDE calculated by resampling the histogram are in excellent agreement.

If your histograms are really noisy (like what you get if you set n = 10 in the above code), you should be a bit cautious when using the resampled KDE for anything other than plotting purposes:

Overall the agreement between the real and resampled KDEs is still good, but the deviations are noticeable.

Munge your categorial data into an appropriate form

Since you haven't posted your actual data I can't give you detailed advice. I think your best bet will be to just number your categories in order, then use that number as the "x" value of each bar in the histogram.

WHat is the purpose of defining "x"? Could you use "e" instead? — KaPy3141, Dec 22 '21 at 11:08

Paul Brodersen · Answer 2 · 2018-12-19T11:28:52.433

I have stated my reservations to applying a KDE to OP's categorical data in my comments above. Basically, as the phylogenetic distance between species does not obey the triangle inequality, there cannot be a valid kernel that could be used for kernel density estimation. However, there are other density estimation methods that do not require the construction of a kernel. One such method is k-nearest neighbour inverse distance weighting, which only requires non-negative distances which need not satisfy the triangle inequality (nor even need to be symmetric, I think). The following outlines this approach:

import numpy as np

#--------------------------------------------------------------------------------
# simulate data

total_classes = 10
sample_values = np.random.rand(total_classes)
distance_matrix = 100 * np.random.rand(total_classes, total_classes)

# Distances to the values itself are zero; hence remove diagonal.
distance_matrix -= np.diag(np.diag(distance_matrix))

# --------------------------------------------------------------------------------
# For each sample, compute an average based on the values of the k-nearest neighbors.
# Weigh each sample value by the inverse of the corresponding distance.

# Apply a regularizer to the distance matrix.
# This limits the influence of values with very small distances.
# In particular, this affects how the value of the sample itself (which has distance 0)
# is weighted w.r.t. other values.
regularizer = 1.
distance_matrix += regularizer

# Set number of neighbours to "interpolate" over.
k = 3

# Compute average based on sample value itself and k neighbouring values weighted by the inverse distance.
# The following assumes that the value of distance_matrix[ii, jj] corresponds to the distance from ii to jj.
for ii in range(total_classes):

    # determine neighbours
    indices = np.argsort(distance_matrix[ii, :])[:k+1] # +1 to include the value of the sample itself

    # compute weights
    distances = distance_matrix[ii, indices]
    weights = 1. / distances
    weights /= np.sum(weights) # weights need to sum to 1

    # compute weighted average
    values = sample_values[indices]
    new_sample_values[ii] = np.sum(values * weights)

print(new_sample_values)

score 2 · Answer 3 · 2019-08-03T10:19:08.343

THE EASY WAY

For now, I am skipping any philosophical argument about the validity of using Kernel density in such settings. Will come around that later.

An easy way to do this is using scikit-learn KernelDensity:

import numpy as np
import pandas as pd
from sklearn.neighbors import KernelDensity
from sklearn import preprocessing

ds=pd.read_csv('data-by-State.csv')

Y=ds.loc[:,'State'].values # State is AL, AK, AZ, etc...

# With categorical data we need some label encoding here...
le = preprocessing.LabelEncoder()
le.fit(Y)                            # le.classes_ would be ['AL', 'AK', 'AZ',...
y=le.transform(Y)                    # y would be [0, 2, 3, ..., 6, 7, 9]
y=y[:, np.newaxis]                   # preparing for kde

kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(y)

# You can control the bandwidth so the KDE function performs better
# To find the optimum bandwidth for your data you can try Crossvalidation

x=np.linspace(0,5,100)[:, np.newaxis] # let's get some x values to plot on
log_dens=kde.score_samples(x)
dens=np.exp(log_dens)            # these are the density function values

array([0.06625658, 0.06661817, 0.06676005, 0.06669403, 0.06643584,
       0.06600488, 0.0654239 , 0.06471854, 0.06391682, 0.06304861,
       0.06214499, 0.06123764, 0.06035818, 0.05953754, 0.05880534,
       0.05818931, 0.05771472, 0.05740393, 0.057276  , 0.05734634,
       0.05762648, 0.05812393, 0.05884214, 0.05978051, 0.06093455,
       ..............
       0.11885574, 0.11883695, 0.11881434, 0.11878766, 0.11875657,
       0.11872066, 0.11867943, 0.11863229, 0.11857859, 0.1185176 ,
       0.11844852, 0.11837051, 0.11828267, 0.11818407, 0.11807377])

And these values are all you need to plot your Kernel Density over your histogram. Capito?

Now, on the theoretical side, if X is a categorical(*), unordered variable with c possible values, then for 0 ≤ h < 1

is a valid kernel. For an ordered X,

where |x1-x2|should be understood as how many levels apart x1 and x2 are. As h tends to zero, both of these become indicators and return a relative frequency counting. h is oftentimes referred to as bandwidth.

(*) No distance needs to be defined on the variable space. Doesn't need to be a metric space.

Devroye, Luc and Gábor Lugosi (2001). Combinatorial Methods in Density Estimation. Berlin: Springer-Verlag.

How can you create a KDE from histogram values only?

3 Answers3

How to plot a "KDE" starting from a histogram

Munge your categorial data into an appropriate form

THE EASY WAY

Linked