0

I'm using a KD-estimation with a custom metric. The metric is obviously slower than the builtin euclidean distance, but works fine. When doing

kde=KernelDensity(...)
kde.fit(X)

I get results in a reasonable amount of time.

When I then calculate

surface=np.exp(kde.score_samples(meshgrid))

where mehsgrid is a numpy array of the size (about) 64000x2, kde calculates the distance on each point in the grid. I seem to bascially misunderstand why that's necessary... The density is already calculated with the .fit() method, and score_samples "should" simply evaluate the density on each point in the grid - right? Do I overlook something?

Whe I do all the calculations with the builtin euclidean metric, the computation is fairly fast, no hint that .score_samples would iterate over gazillions of points...

Any hint is appreciated.

1 Answers1

0

You need to compute the density at the meshgrid points if you want to score the samples. Depending on how you pass the metric, this will be done using a brute-force approach, which means computing the distances to all the points.

You can use your metric with the built-in BallTree, which might save you some computation, but that depends on your dataset and the metric you use.

Andreas Mueller
  • 27,470
  • 8
  • 62
  • 74
  • Whar do you mean exactly by "how you pass the metric"? I wrote a class that does some precomputations (and also calculates the distance), and then pass one method of that class in a dict to metric_params in KernelDensity. I also use the builtin ball-tree. The call is something like `KernelDensity(...metric="pyfunc",metric_params={"func":fancyClass.distanceMethod,"more_metric params":more_values},algorithm=ball_tree)` –  May 16 '15 at 19:31
  • Well that is exactly what I meant. That is the best you can do I think. – Andreas Mueller May 18 '15 at 16:06