I have a large dataset (~300,000 data points) from which I sample about ~300,000 numbers. I first form an empirical CDF, and then use intrep1d
to create an interpolation object for the inverse of CDF. Then I generate random numbers from a uniform distribution, and get the value of the interpolated function, which is the sampled number:
def sampleFromDistr(data, sampleSize):
t0 = time.time()
# forming empirical CDF
sortedData = np.sort(np.array(data))
yvals = np.arange(len(sortedData)) / float(len(sortedData))
# linear interpolation object for the inverse of the cdf
f = interp1d(yvals, sortedData)
# create the sample set
sample = []
with click.progressbar(range(sampleSize), label='sampling') as bar:
for i in bar:
# sampling one by one
sample.append(float(f(random.uniform(0, max(yvals)))))
t1 = time.time()
print t1 - t0
return sample
The problem is, this piece of code works really slowly. And it seems to work at different speeds depending on the data.
So I ran a few tests, using uniformly distributed numbers as my dataset:
>>> test = [random.random() for s in range(1000)]
>>> sample = sampleFromDistr(test, 10)
sampling [####################################] 100%
0.00515699386597
>>> sample = sampleFromDistr(test, 100)
sampling [####################################] 100%
0.0200171470642
>>> sample = sampleFromDistr(test, 1000)
sampling [####################################] 100%
0.16183590889
>>> sample = sampleFromDistr(test, 10000)
sampling [####################################] 100%
1.56129717827
>>> sample = sampleFromDistr(test, 100000)
sampling [####################################] 100%
16.2284870148
>>> sample = sampleFromDistr(test, 1000000)
sampling [####################################] 100%
174.504947901
Which was surprising, because with a dataset of ~300,000 elements, the estimated time for sampling was ~2 hours. So I tried to increase the dataset size:
>>> test = [random.random() for s in range(10000)]
>>> sample = sampleFromDistr(test, 1000000)
sampling [#####-------------------------------] 15% 00:09:42
I also took a look at the interp1d source code. Is it the part where intrep1d
looks for the nearest neighbors, by calling numpy.searchsorted()
, that is slowing the code down? If so, what could I to make the code faster?
EDIT: I have found out that bisect.bisect()
is 10 times faster than numpy.searchsorted()
. Is it possible to change this part of the original interp1d
method without modifying the original file?
EDIT 2: My attempt at a solution:
import numpy as np
from scipy.interpolate import interp1d
import random
import pdb
import click
import time
import bisect
clip = np.clip
class interp1dMod(interp1d):
def _call_linear(self, x_new):
x_new_indices = bisect.bisect_left(self.x, x_new)
x_new_indices = x_new_indices.clip(1, len(self.x) - 1).astype(int)
lo = x_new_indices - 1
hi = x_new_indices
x_lo = self.x[lo]
x_hi = self.x[hi]
y_lo = self._y[lo]
y_hi = self._y[hi]
slope = (y_hi - y_lo) / (x_hi - x_lo)[:, None]
y_new = slope * (x_new - x_lo)[:, None] + y_lo
return y_new
def sampleFromDistr(data, sampleSize):
t0 = time.time()
sortedData = np.sort(np.array(data))
yvals = np.arange(len(sortedData)) / float(len(sortedData))
f = interp1dMod(yvals, sortedData)
sample = []
with click.progressbar(range(sampleSize), label='sampling') as bar:
for i in bar:
sample.append(float(f(random.uniform(0, max(yvals)))))
t1 = time.time()
print t1 - t0
return sample
which results in the following error: AttributeError: 'int' object has no attribute 'clip'
. What am I doing wrong?