3

Is it possible to obtain a KDE for periodic data in sklearn or a different Python package?

Here is a simplified example: I am creating a mockup dataset from two normal distributions and map it to the interval from 0 to 20.

import numpy as np
import matplotlib.pyplot as plt

# create dataset
data = np.hstack((np.random.normal(8, 2, 200), np.random.normal(19, 4, 200))) % 20

When I plot the result of the KDE

# fit
from sklearn.neighbors import KernelDensity
kde = KernelDensity(bandwidth=1, kernel='gaussian')
kde.fit(data[:, None])

# plot
x_d = np.linspace(0, 20, 100)
logprob = kde.score_samples(x_d[:, None])
plt.fill_between(x_d, np.exp(logprob), alpha=0.5)
plt.hist(data, histtype = "step", density = True)
plt.ion()
plt.show()

it (obviously) does not know about the periodicity of the data:

KDE result

as you can see from the estimate having three peaks and being non-smooth at the boundaries.

fuenfundachtzig
  • 7,952
  • 13
  • 62
  • 87

1 Answers1

2

I've been looking for this myself, though finding a package that directly / simply supports this seems surprisingly hard... In the mean time I had to come up with a workaround... its a bit painful but it works.

Since you are on a 1D periodic domain (from 0 to 20) one solution is to map the data to a higher dimensional 2D circular manifold. E.g. each data point gets mapped from 0-20 to 0-2pi, then mapped to the unit circle as x2d = cos(x_r), sin(x_r). At that point you can run the kde to get the density in the higher dimension space, then sample the manifold and normalize to a probability / density

as a final note, if you want to apply this to latitude / longitude data, your manifold would be a unit sphere instead, so you would need to transform accordingly.

scaleFactor=2*np.pi/20. #need to rescale from 0-20 to 0-2pi
data2d=np.array(
    [
        np.cos(data*scaleFactor),
        np.sin(data*scaleFactor)
    ]
).T

#also need to scale the bandwidth appropriately (if you have one picked)
kde2d = KernelDensity(bandwidth=1*scaleFactor, kernel='gaussian')
kde2d.fit(data2d)

x_d=np.linspace(0,20,100)
x_2d=np.array(
    [
        np.cos(x_d*scaleFactor),
        np.sin(x_d*scaleFactor)
    ]
).T

logprob2d = kde2d.score_samples(x_2d)
prob2d=np.exp(logprob2d)
prob2d=prob2d/np.sum(prob2d)
densFactor=len(prob2d)/20. #convert from probability to density
dens2d=prob2d*densFactor
plt.fill_between(
    x_d, 
    dens2d,
    alpha=0.5)
plt.hist(data, histtype = "step", density = True)
plt.ion()
plt.show()

#verify that prob2d is now periodic
print(prob2d[0],prob2d[-1])

There is another alternative as well... create shifted copies of the data and append them (like adding periodic images around a 'central cell')... this can be quite a memory and time hog though. In higher dimensions you end up needing (3^N)-1 copies. This will also make scaling a serious issue since many KDE methods begin scaling non-linearly with respect to sample size after a certain point.