13

this is my code:

import numpy as np
from scipy.stats.kde import gaussian_kde
from scipy.stats import norm
from numpy import linspace,hstack
from pylab import plot,show,hist

import re
import json

attribute_file="path"

attribute_values = [line.rstrip('\n') for line in open(attribute_file)]

obs=[]

#Assume the list obs as loaded

obs=np.asarray(osservazioni)
obs=np.sort(obs,kind='mergesort')
x_min=osservazioni[0]
x_max=osservazioni[len(obs)-1]



# obtaining the pdf (my_pdf is a function!)
my_pdf = gaussian_kde(obs)

# plotting the result
x = linspace(0,x_max,1000)

plot(x,my_pdf(x),'r') # distribution function

hist(obs,normed=1,alpha=.3) # histogram
show()

new_values = np.asarray([-1, 0, 2, 3, 4, 5, 768])[:, np.newaxis]
for e in new_values:
    print (str(e)+" - "+str(my_pdf(e)*100*2))

Problem: The obs array contains a list of all obs. I need to calcolate a score (between 0 and 1) for new values

[-1, 0, 2, 3, 4, 500, 768]

So the value -1 must have a discrete score because it doesn't appears in the distribution but is next to the 1 value that is very common in the observations.

Machavity
  • 30,841
  • 27
  • 92
  • 100
Usi Usi
  • 2,967
  • 5
  • 38
  • 69
  • What should your score represent? Using KDE you'll get high scores for values close to the frequent ones in your dataset. If you're interested in different outcome, maybe you shold think about using a different model. – liborm Aug 21 '15 at 18:12

2 Answers2

9

The reason for that is that you have many more 1's in your observations than 768's. So even if -1 is not exactly 1, it gets a high predicted value, because the histogram has a much larger larger value at 1 than at 768.

Up to a multiplicative constant, the formula for prediction is:

enter image description here

where K is your kernel, D your observations and h your bandwitdh. Looking at the doc for gaussian_kde, we see that if no value is provided for bw_method, it is estimated in some way, which here doesn't suit you.

So you can try some different values: the larger the bandwidth, the more points far from your new data are taken into account, the limit case being an almost constant predicted function.

On the other hand, a very small bandwidth only takes really close points into account, which is what I thing you want.

Some graphs to illustrate the influence of the bandwidth: enter image description here

Code used:

import matplotlib.pyplot as plt
f, axarr = plt.subplots(2, 2, figsize=(10, 10))
for i, h in enumerate([0.01, 0.1, 1, 5]):
    my_pdf = gaussian_kde(osservazioni, h)
    axarr[i//2, i%2].plot(x, my_pdf(x), 'r') # distribution function
    axarr[i//2, i%2].set_title("Bandwidth: {0}".format(h))
    axarr[i//2, i%2].hist(osservazioni, normed=1, alpha=.3) # histogram

With your current code, for x=-1, the value of K((x-x_i)/h) for all x_i's who are equal to 1 is smaller than 1, but you add up a lot of these values (there are 921 1s in your observations, and also 357 2s)

On the other hand for x = 768, the value of the kernel is 1 for all x_i's which are 768, but there are not many such points (39 to be precise). So here a lot of "small" terms make a larger sum than a small number of larger terms.

If you don't want this behavior, you can decrease the size of your gaussian kernel : this way the penalty (K(-2)) paid because of the distance between -1 and 1 will be higher. But I think that this would be overfitting your observations.

A formula to determine whether a new sample is acceptable (compared to your empirical distribution) or not is more of a statistical problem, you can have a look at stats.stackexchange.com

You can always try to use a low value for the bandwidth, which will give you a peaked predicted function. Then you can normalize this function, dividing it by its maximal value.

After that, all predicted values will be between 0 and 1:

maxDensityValue = np.max(my_pdf(x))
for e in new_values:
    print("{0} {1}".format(e, my_pdf(e)/maxDensityValue))
P. Camilleri
  • 12,664
  • 7
  • 41
  • 76
  • Nice explanation, thank you also to you... what you suggest to achieve what I need? – Usi Usi Aug 22 '15 at 00:55
  • Thanks for the precise improvements on your answer... Could you also make me an example of the last part? How can I find the maximum value to normalize the function? – Usi Usi Aug 23 '15 at 08:52
  • @UsiUsi Not sure, but it seems that it's always going to be `my_pdf(1)`. Otherwise, just use `np.max(my_pdf(x))`. – P. Camilleri Aug 23 '15 at 13:26
  • Sorry I was out for a short holiday. it seems ok... But what about normalizing the scores? is difficult to find a threshold... I need something more sophisticated. These are my results.. score for [-1] is [ 0.59625501] score for [1] is [ 0.98929683] score for [0] is [ 0.84244511] score for [10] is [ 0.00987971] score for [128] is [ 0.05891493] score for [256] is [ 0.10650007] score for [300] is [ 1.00605929e-08] score for [4] is [ 0.5433299] score for [500] is [ 3.18585441e-07] score for [768] is [ 0.02945747] score for [512] is [ 0.43053219] – Usi Usi Aug 28 '15 at 15:48
  • This looks like a statistical problem rather than a programming one... I'd suggest 1e-2 as threshold, but completely by rule of thumb. – P. Camilleri Aug 28 '15 at 15:57
  • I granted the bounty... If no how can I grant it? – Usi Usi Aug 31 '15 at 00:59
  • Dear Massias could you explain what are x and xi in the formula? – Usi Usi Sep 11 '15 at 14:52
  • 1
    x is the point where you want to predict your function, the x_i's (i is varying) are all the points in your dataset ("observations") – P. Camilleri Sep 11 '15 at 19:30
1

-1 and 0 are both very close to 1 which occurs very frequently, so they will be predicted to have a higher value. (This is why 0 has a higher value than -1 even though they both don't show up, 0 is closer to 1).

What you need is a smaller bandwidth: Look at the line in your graph to see this - Right now numbers that don't show up at all as far away as 80 are getting a lot of value because of their proximity to 1 and 2.
Just set a scalar as your bandwidth_method to achieve this:

my_pdf = gaussian_kde(osservazioni, 0.1)

This may not be the exact scalar you want but try changing 0.1 to 0.05 or even less and see what fits what you are looking for.

Also if you want a value between 0 and 1 you need to make sure that my_pdf() can never return a value over .005 because you are multiplying it by 200.
Here is what I mean:

for e in new_values:
    print (str(e)+" - "+str(my_pdf(e)*100*2))

The value you are outputting is:

mypdf(e)*100*2 == mypdf(e)*200
#You want the max value to be 1 so
1 >= mypdf(e)*200
#Divide both sides by 200
0.005 >= mypdf(e)

So mypdf() needs to have a max value of 0.005. OR You can just scale the data.

For the max value to be 1 and stay proportionate to the input, no matter the input, you would need to first collect the output and then scale it based on the largest value.
Example:

orig_val=[] #Create intermediate list

for e in new_values:
    orig_val += [my_pdf(e)*100*2] #Fill with the data

for i in range(len(new_values)):
    print (str(new_values[i])+" - "+str(orig_val[i]/max(orig_val))) #Scale based on largest value

Learn more about the gaussian_kde here: scipy.stats.gaussian_kde

ThatGuyRussell
  • 1,361
  • 9
  • 18
  • Thank you si much for your answer, it helps me a lot. But I can't understand the part "Also if you want a value between 0 and 1 you need to make sure that my_pdf() can never return a value over .005 because you are multiplying it by 200." can you add more info about it? What I need is a sort of threshold... Is the score is greater than the threshold the value is reliable, if the value is lower than the threshold my algo must discard it... Thanks – Usi Usi Aug 22 '15 at 00:13
  • Certainly @UsiUsi I will clear that up in my code now and then tell me if it helps! By the way, did changing the bandwidth work for you? – ThatGuyRussell Aug 22 '15 at 00:37
  • Sure, I did a fast test and it with your suggestion the kde follow better the original distribution... in this way I get better result... to be more clear now I get an higher score for the values that are very common in the original observations.. and a lower score for all other values... That's in part what I need... What I miss at the moment is a way to calculate a score only between to 0 and 1. I need a formula to create a threshold that can give to the algo a way to decide if a new value is attendible or not – Usi Usi Aug 22 '15 at 00:54
  • @UsiUsi Let me know if my solution for the scaling helped! – ThatGuyRussell Aug 22 '15 at 01:14
  • @UsiUsi Also if this solution does work for you, please accept it so other users searching for the answer to this question can know that this solution works. – ThatGuyRussell Aug 22 '15 at 03:04
  • Don't mean to sound petty, but you can also accept another answer if it's better. – P. Camilleri Aug 22 '15 at 11:45
  • @M.Massias Of course. Sorry If I made it seem like accepting a better option wasn't a choice. Your answer wasn't up at the time, it is a great answer. – ThatGuyRussell Aug 22 '15 at 14:32