1

I have a random variable as follows:

f(x) = 1 with probability g(x)

f(x) = 0 with probability 1-g(x)

where 0 < g(x) < 1.

Assume g(x) = x. Let's say I am observing this variable without knowing the function g and obtained 100 samples as follows:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binned_statistic

list = np.ndarray(shape=(200,2))

g = np.random.rand(200)
for i in range(len(g)):
    list[i] = (g[i], np.random.choice([0, 1], p=[1-g[i], g[i]]))

print(list)
plt.plot(list[:,0], list[:,1], 'o')

Plot of 0s and 1s

Now, I would like to retrieve the function g from these points. The best I could think is to use draw a histogram and use the mean statistic:

bin_means, bin_edges, bin_number = binned_statistic(list[:,0], list[:,1], statistic='mean', bins=10)
plt.hlines(bin_means, bin_edges[:-1], bin_edges[1:], lw=2)

Histogram mean statistics

Instead, I would like to have a continuous estimation of the generating function.

I guess it is about kernel density estimation but I could not find the appropriate pointer.

  • you can find kdes in `Statsmodels` `sklearn` and `scipy` has one too. If you want just a plot look at `seaborn` and it's `distplot` or `kdeplot`. But why do you want a kde for binary data? – Marvin Taschenberger Jul 14 '17 at 14:34
  • @MarvinTaschenberger It is possible that my remarks about kde may be misleading. It seems that I have a logistic regression problem. https://en.wikipedia.org/wiki/Logistic_regression#Example:_Probability_of_passing_an_exam_versus_hours_of_study. But I am not trying to fit a model. I want to plot it in a smooth fashion. – user1860037 Jul 14 '17 at 15:25
  • 1
    This also looks relevant: http://thestatsgeek.com/2014/09/13/checking-functional-form-in-logistic-regression-using-loess/ – user1860037 Jul 14 '17 at 15:35

1 Answers1

1

straightforward without explicitly fitting an estimator:

import seaborn as sns 
g = sns.lmplot(x= , y= , y_jitter=.02 , logistic=True)

plug in x= your exogenous variable and analogously y = dependent variable. y_jitter is jitter the point for better visibility if you have a lot of data points. logistic = True is the main point here. It will give you the logistic regression line of the data.

Seaborn is basically tailored around matplotlib and works great with pandas, in case you want to extend your data to a DataFrame.

  • 1
    Now, I understand that what I was looking for is Locally Weighted Scatterplot Smoothing. Thank you pointing sns. df = pd.DataFrame() df['x'] = list[:,0] df['y'] = list[:,1] sns.lmplot(x='x', y='y', data = df, lowess=True) plt.show() – user1860037 Jul 15 '17 at 06:32