0

I am facing the following problem: I have (large) sample of unevenly distributed points $(X_i,Y_i)$ in a 2D space. I would like to determine the local extremas of the density of the distribution.

Does the function KernelDensity allow to estimate the density of the sample in a point outside the sample ?

if yes, i cannot find the right syntax ?

Here is an example:

import numpy as np
import pandas as pd
mean0=[0,0]
cov0=[[1,0],[0,1]]
mean1=[3,3]
cov1=[[1,0.2],[0.2,1]]
A=pd.DataFrame(np.vstack((np.random.multivariate_normal(mean0, cov0, 5000),np.random.multivariate_normal(mean1, cov1, 5000))))
A.columns=['X','Y']
A.describe()

from sklearn.neighbors import KernelDensity
kde = KernelDensity(bandwidth=0.04, metric='euclidean',
                        kernel='gaussian', algorithm='ball_tree')
kde.fit(A)

If I make this query

kde.score_samples([(0,0)])

i get a negative number, clearly not a density !!

array([-2.88134574])

I don't know if its the right approach. I would like then use that function to use an optimizer to get local extremas. (which library/function would you recommend ?)

EDIT: yes this is a log density, not a density so it can be a negative number

Fagui Curtain
  • 1,867
  • 2
  • 19
  • 34
  • 1
    Without trying, i can't tell if the input-format of A is correct (nsamplesXnfeatures is needed, usually arrays, but maybe a DataFrame is working too). But apart from that: this negative number you obtain is a **log-density** (which is also documented [here](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html#sklearn.neighbors.KernelDensity.score_samples)). Another remark: for good results, one should tune the bandwith-param trough CV (GridSearchCV in sklearn, or optimization-based CV in statsmodels). – sascha Jun 10 '16 at 18:43
  • ok i'll pursue on this idea. Is there a library/ function getting all local extremas ? – Fagui Curtain Jun 11 '16 at 04:15

0 Answers0