0

I'm currently working on some seasonality estimation for a time series dataset.

What I get is a dataset of possible frequencies/periods that might occur in the dataset. Thus, this is somewhat noisy (e.g. having some periods as [100, 98, 101, 102] that actually should be "the same").

For estimating sharp periods, I try to estimate peaks via kernel density estimation (kde, sklearn.neighbors.KernelDensity) as follows:

import numpy as np
from sklearn.neighbors import KernelDensity
from scipy import signal
import matplotlib.pyplot as plt

X1 = np.random.randint(1, 4, 20)
X2 = np.random.randint(10, 13, 200)
X = np.concatenate((X1, X2), axis=0)
# the peaks schould be at 2 and 11!

bw = 1

kde = KernelDensity(kernel='gaussian', bandwidth=bw).fit(X.reshape(-1, 1))
estimator = np.linspace(0, 15, 100)
kde_est = np.exp(kde.score_samples(estimator.reshape(-1, 1)))

plt.plot(estimator, kde_est)

peaks_pos = signal.argrelextrema(kde_est, np.greater)[0]

print(estimator[peaks_pos])
# the peaks are at around 2 and 11!

Additionally, I'd like to know how the kernels for this estimation look like. For the gaussian case, there should be a set of /mu and /sigma should be available for all [default] 40 kernels. Can I access this information? I could not find a clue in the documentation or the details of the kde attributes. But I'm pretty sure, this should be available somehere.

For clarification, why I need this:

In the following example, the 2 peaks are too close together to be found, but I'm sure the kernels would show up.

X1 = np.random.randint(1, 4, 20)
X2 = np.random.randint(5, 8, 200)
X = np.concatenate((X1, X2), axis=0)
# the peaks schould be at 2 and 6!

bw = 1

kde = KernelDensity(kernel='gaussian', bandwidth=bw).fit(X.reshape(-1, 1))
estimator = np.linspace(0, 15, 100)
kde_est = np.exp(kde.score_samples(estimator.reshape(-1, 1)))

plt.plot(estimator, kde_est)

peaks_pos = signal.argrelextrema(kde_est, np.greater)[0]

print(estimator[peaks_pos])
# the peaks are at around 6 and sometimes 2!
Ben Müller
  • 125
  • 1
  • 11

1 Answers1

2

I believe what you are looking for cannot be found in Kernel Density Estimation. All the kernels in KDE have exactly the same shape (standard deviation) and are centred around the datapoints (so the means are determined by the values in X).

What you can do to prevent the closeness of the normal distributions to obscure peaks is to tweak the bandwidth (I managed to get quite consistent 2 peaks by using a bandwidth of 0.7 if your 2nd sample. There are algebraic ways to do this (see: wikipedia), or you can use cross validation to pick the best bandwith for your sample (see: blog on the subject).

If however you want to split your dataset into distinct components described by normal distributions with various shapes (weights, means and covariances), you probably want to use Gaussian Mixture Modelling. You can find an example of that below. For determining the optimal number of components there are various methods, such as the silhouette criteria or the akaike information criteria (built into scikitlearn). Since we know there are 2 normal distributions in the example I did not implement such a criteria but you can find more information on the internet easily.

X1 = np.random.randint(1, 4, 20)
X2 = np.random.randint(5, 8, 200)
X = np.concatenate((X1, X2), axis=0)
# the peaks schould be at 2 and 6!

components = 2

gmm = GaussianMixture(n_components = components).fit(X.reshape(-1,1))

#you can now directly get the means from the gaussian mixture models components,
#skipping the score_samples and signal.argrelextrema steps.
print gmm.means_
#the means are around 2 and 6!


#your original method of getting the peaks:
estimator = np.linspace(0, 15, 100)
gmm_est = np.exp(gmm.score_samples(estimator.reshape(-1,1)))

plt.hist(X,normed=True)
plt.plot(estimator,gmm_est,linewidth=5,color='black',alpha=0.7)


peaks_pos = signal.argrelextrema(gmm_est, np.greater)[0]

print(estimator[peaks_pos])


#plotting the separate components:
for n,weight in enumerate(gmm.weights_):
    plt.plot(estimator,weight*stats.norm.pdf(estimator,gmm.means_[n][0],np.sqrt(gmm.covariances_[n][0][0])))
plt.show()

image of results

  • Thank you for your great comment. I will try your proposed Gaussian de-mixing scheme. For the bandwidth: I have already a rough estimator. Cross-Validation is not an option, as this would cost to much time (=> should be able to run quickly on a 3D-stack). – Ben Müller Oct 20 '17 at 11:29
  • Bandwidth estimation that is quicker than cross-validation can also be found in the statsmodels package (Silverman or Scott rule of thumb) FYI. http://www.statsmodels.org/stable/generated/statsmodels.nonparametric.bandwidths.bw_scott.html – Dominique Fuchs Oct 20 '17 at 11:38
  • Yes, that's what I implemented (without using this package) :) – Ben Müller Oct 20 '17 at 13:25
  • I tried his on my "more complex" system, but it is not stable enough. Thank you for the advice. scikitlearn.mixture.GaussianMixture does what I was looking for, but my signal is too weak. – Ben Müller Oct 26 '17 at 11:24