5

I have dataset like the following fromat and im trying to find out the Kernel density estimation with optimal bandwidth.

data = np.array([[1, 4, 3], [2, .6, 1.2], [2, 1, 1.2],
         [2, 0.5, 1.4], [5, .5, 0], [0, 0, 0],
         [1, 4, 3], [5, .5, 0], [2, .5, 1.2]])

but I couldn't figure out how to approach it. also how to find the Σ matrix.

UPDATE

I tried KDE function from scikit-learn toolkits to find out univariate(1D) kde,

# kde function
def kde_sklearn(x, x_grid, bandwidth):
    kde = KernelDensity(kernel='gaussian', bandwidth=bandwidth).fit(x)
    log_pdf = kde.score_samples(x_grid[:, np.newaxis])
    return np.exp(log_pdf)

# optimal bandwidth selection
from sklearn.grid_search import GridSearchCV
grid = GridSearchCV(KernelDensity(), {'bandwidth': np.linspace(.1, 1.0, 30)}, cv=20)
grid.fit(x)
bw = grid.best_params_

# pdf using kde
pdf = kde_sklearn(x, x_grid, bw)
ax.plot(x_grid, pdf, label='bw={}'.format(bw))
ax.legend(loc='best')
plt.show()

Can any one help me to extend this to multivariate / in this case 3D data?

jquery404
  • 653
  • 1
  • 12
  • 26
  • I'm wondering if I can help with this, but need to understand a little more. I can see that each datapoint will have three values, but then the way you've written it, those triplets are further grouped into groups of three. Is there a reason why your input data is grouped twice? Also want to double check what you mean by Σ matrix. I'm assuming you mean the estimated data covariance - so you can use a rule of thumb bandwidth of Σ^(-1/2) ? If so, are you intending that as a place to start the bandwidth optimisation, or instead of optimisation? – J Richard Snape Jun 10 '15 at 08:45
  • Did my answer help? If not - feel free to add some comments as I might be able to adapt it to your needs. – J Richard Snape Jun 11 '15 at 12:59
  • @JRichardSnape you are right, I grouped the data wrong way, actually in my code it was just like you implemented but when copy the code in SO I messed up. and yes by Σ I meant the covariance matrix. – jquery404 Jun 12 '15 at 04:15
  • OK. But I'm still not sure whether my answer below helped - does that get you what you wanted? Or is there something more to your question? If you want to output the covariance matrix, I can add that. – J Richard Snape Jun 12 '15 at 10:34

1 Answers1

5

Interesting problem. You have a few options:

  1. Continue with scikit-learn
  2. Use a different library. For instance, if the kernel you are interested in is the gaussian - then you could use scipy.gaussian_kde which is arguably easier to understand / apply. There is a very good example of this technique in this question.
  3. roll your own from first principles. This is very difficult and I don't recommend it

This blog post goes into detail about the relative merits of various library implementations of Kernel Density Estimation (KDE).


I'm going to show you what (in my opinion - yes this is a bit opinion based) is the simplest way, which I think is option 2 in your case.

NOTE This method uses a rule of thumb as described in the linked docs to determine bandwidth. The exact rule used is Scott's rule. Your mention of the Σ matrix makes me think rule of thumb bandwidth selection is OK for you, but you also talk about optimal bandwidth and the example you present uses cross-validation to determine the best bandwidth. Therefore, if this method doesn't suit your purposes - let me know in comments.

import numpy as np
from scipy import stats
data = np.array([[1, 4, 3], [2, .6, 1.2], [2, 1, 1.2],
         [2, 0.5, 1.4], [5, .5, 0], [0, 0, 0],
         [1, 4, 3], [5, .5, 0], [2, .5, 1.2]])

data = data.T #The KDE takes N vectors of length K for K data points
              #rather than K vectors of length N

kde = stats.gaussian_kde(data)

# You now have your kde!!  Interpreting it / visualising it can be difficult with 3D data
# You might like to try 2D data first - then you can plot the resulting estimated pdf
# as the height in the third dimension, making visualisation easier.

# Here is the basic way to evaluate the estimated pdf on a regular n-dimensional mesh
# Create a regular N-dimensional grid with (arbitrary) 20 points in each dimension
minima = data.T.min(axis=0)
maxima = data.T.max(axis=0)
space = [np.linspace(mini,maxi,20) for mini, maxi in zip(minima,maxima)]
grid = np.meshgrid(*space)

#Turn the grid into N-dimensional coordinates for each point
#Note - coords will get very large as N increases...
coords = np.vstack(map(np.ravel, grid))

#Evaluate the KD estimated pdf at each coordinate
density = kde(coords)

#Do what you like with the density values here..
#plot them, output them, use them elsewhere...

Caveat

this may give terrible results, depending on your particular problem. Things to bear in mind are obviously:

  1. as your number of dimensions goes up, the number of observed data points you want will have to go up exponentially - your sample data of 9 points in 3 dimensions is pretty sparse - although I assume the dots indicate that in fact you have many more.

  2. As mentioned in the main body - the bandwidth is selected in a particular way - this may result in over- (or conceivably but unlikely under-) smoothing of the estimated pdf.

Community
  • 1
  • 1
J Richard Snape
  • 20,116
  • 5
  • 51
  • 79
  • thank you so much for your help. it was quite helpful. I still got 2 questions i hope you can help me with that. (1) how i can use my own bandwidth like we can do in sklearn.kde (gridcrossover) (2) as u said, plot in 2d first then height in 3d, can you also show me how to do this here what i tried http://codepad.co/s/ac8623 – jquery404 Jun 13 '15 at 04:26
  • I will have a look at these questions when I have some time. – J Richard Snape Jun 13 '15 at 06:42
  • Sorry, I haven't had chance to look at this. 1) You can set your own bandwidth: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.gaussian_kde.set_bandwidth.html#scipy-stats-gaussian-kde-set-bandwidth I haven't used this and examples seem to be for 1D case. In terms of visualisation - what I was recommending was start with 2D *input data* rather than 3D. The code you've put on that link doesn't have right imports etc so won't run at all. – J Richard Snape Jun 15 '15 at 12:50
  • hi can you help me to show the bandwidth matrix from your code.. i tried kde.factor but it giving me float number. but for multivariate case (3d) isn't it suppose to show 3x3 bandwidth matrix. thanks – jquery404 Jul 07 '15 at 15:53
  • 2
    You need to look at `kde.covariance` too. `kde.factor` multiplies `kde.covariance` to get the kernel covariance matrix or bandwidth that I think you want to see (that I think you call Σ above). It's detailed at the bottom of [the docs page](http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.gaussian_kde.html) – J Richard Snape Jul 07 '15 at 16:03