3

Given a distribution, let's say, a gaussian:

import pandas as pd
import numpy as np

gaussian_distribution = np.random.normal(0,1,10_000)

This sample looks like this:

enter image description here

What I want to do is to resample this distribution to somehow get a uniform distribution, so:

Pr(X) = Pr(X+W)

I am not worried with ending with n < 10_000, I just want to remove the distribution peak.

I read something about interpolating a distribution on this one, but I could not figure it out how this works.

Victor Maricato
  • 672
  • 8
  • 25
  • 1
    What is X and what is W in Pr(X) = Pr(X+W)? What do you mean by resample the distribution? – Chachni Mar 09 '21 at 20:48
  • 1
    Does this answer your question? https://stackoverflow.com/questions/63738389/pandas-sampling-from-a-dataframe-according-to-a-target-distribution/63739234#63739234 – anon01 Mar 11 '21 at 06:38
  • @Chachni The Pr(X) = Pr(X+W) means that the probability is uniform. The Pr describes the probability density function. Resample the distribution means downsample the original distribution in a way that it is now distributed as the desired distribution (in this case, uniform). Resampling is just to clarify that I do not want a brand new uniform distribution, I want the original one to look like a uniform distribution. – Victor Maricato Mar 11 '21 at 15:14
  • @anon01 Yes, that is exactly what I was looking for – Victor Maricato Mar 11 '21 at 15:30
  • If I find time this weekend I'll write an improved/clarified version here – anon01 Mar 11 '21 at 16:03
  • If you could provide some guidance on how to find those "sample_probs" for an arbritary target and source distribution, not only gaussian and uniform. The target is only curiosity, my use case involves only uniform. But the source could be helpful, my real distribution is approximately gamma distributed, not gaussian. I set gaussian in the question only for a matter of simplification. :S – Victor Maricato Mar 11 '21 at 16:27
  • there is an issue. samples from normal distribution are *unbounded*, it can arbitrary large though rare. It means that after resampling the output distribution would be *unbounded* as well. It's a problem because uniform distribution is *bounded* to interval. Zero outside of it. Thus there exists no well defined transform that will map normal samples to uniform ones. You need to provide bounds like (-3,3). – tstanisl Mar 12 '21 at 23:41

2 Answers2

3

I am not sure why you would want to do this, or why it is important to keep the original samples as opposed to resampling a uniform distribution with boundaries corresponding to your histogram's. But here is an approach, as you requested: take a histogram of sufficient granularity and resample the points falling into each bin inverse-proportionally to the bin height. You would end up taking an equal number (roughly) of points from each bin interval.

x = np.random.randn(10_000)
counts, bins = np.histogram(x, bins=10)
subsampled = []
for i in range(len(bins)-1):
  if i == len(bins)-2:
    # last bin is inclusive on both sides
    section = x[(x>=bins[i]) & (x<=bins[i+1])]
  else:
    section = x[(x>=bins[i]) & (x<bins[i+1])]
  sub_section = np.random.choice(section, np.amin(counts), replace=False)
  subsampled.extend(sub_section)

A limitation of this quick & dirty solution is that the smallest bin gets to dictate the height of your resultant uniform distribution. As a consequence, fewer bins in your histogram will not make the subsampled points as uniform but will allow you to retain more of them. You could cut off the tails as well to remedy this.

Original: histogram of x

Subsampled: histogram of subsampled

  • I could not yet try your approach, but it seems to do what I was expecting to indeed. I think that the behaviour in which the smallest bin delimitates the others is expected, if that wasn't the case, the final distribution would not be uniform. – Victor Maricato Mar 08 '21 at 16:31
-2

There is a function called np.random.uniform

import matplotlib.pyplot as plt
mu, sigma = 0, 0.1 # mean and standard deviation
s = np.random.uniform(mu, sigma, 1000)
count, bins, ignored = plt.hist(s, 30, density=True)
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *
np.exp( - (bins - mu)**2 / (2 * sigma**2) ),linewidth=2, color='r')
plt.show()

enter image description here