2

I have a large dataset (~300,000 data points) from which I sample about ~300,000 numbers. I first form an empirical CDF, and then use intrep1d to create an interpolation object for the inverse of CDF. Then I generate random numbers from a uniform distribution, and get the value of the interpolated function, which is the sampled number:

def sampleFromDistr(data, sampleSize):
    t0 = time.time()

    # forming empirical CDF
    sortedData = np.sort(np.array(data))
    yvals = np.arange(len(sortedData)) / float(len(sortedData))

    # linear interpolation object for the inverse of the cdf
    f = interp1d(yvals, sortedData)

    # create the sample set
    sample = []
    with click.progressbar(range(sampleSize), label='sampling') as bar:
        for i in bar:

            # sampling one by one
            sample.append(float(f(random.uniform(0, max(yvals)))))

    t1 = time.time()
    print t1 - t0

    return sample

The problem is, this piece of code works really slowly. And it seems to work at different speeds depending on the data.

So I ran a few tests, using uniformly distributed numbers as my dataset:

>>> test = [random.random() for s in range(1000)]
>>> sample = sampleFromDistr(test, 10)
sampling  [####################################]  100%
0.00515699386597
>>> sample = sampleFromDistr(test, 100)
sampling  [####################################]  100%
0.0200171470642
>>> sample = sampleFromDistr(test, 1000)
sampling  [####################################]  100%
0.16183590889
>>> sample = sampleFromDistr(test, 10000)
sampling  [####################################]  100%          
1.56129717827
>>> sample = sampleFromDistr(test, 100000)
sampling  [####################################]  100%          
16.2284870148
>>> sample = sampleFromDistr(test, 1000000)
sampling  [####################################]  100%          
174.504947901

Which was surprising, because with a dataset of ~300,000 elements, the estimated time for sampling was ~2 hours. So I tried to increase the dataset size:

>>> test = [random.random() for s in range(10000)]
>>> sample = sampleFromDistr(test, 1000000)
sampling  [#####-------------------------------]   15%  00:09:42

I also took a look at the interp1d source code. Is it the part where intrep1d looks for the nearest neighbors, by calling numpy.searchsorted(), that is slowing the code down? If so, what could I to make the code faster?

EDIT: I have found out that bisect.bisect() is 10 times faster than numpy.searchsorted(). Is it possible to change this part of the original interp1d method without modifying the original file?

EDIT 2: My attempt at a solution:

import numpy as np
from scipy.interpolate import interp1d
import random
import pdb
import click
import time
import bisect

clip = np.clip


class interp1dMod(interp1d):
    def _call_linear(self, x_new):
        x_new_indices = bisect.bisect_left(self.x, x_new)
        x_new_indices = x_new_indices.clip(1, len(self.x) - 1).astype(int)
        lo = x_new_indices - 1
        hi = x_new_indices

        x_lo = self.x[lo]
        x_hi = self.x[hi]
        y_lo = self._y[lo]
        y_hi = self._y[hi]

        slope = (y_hi - y_lo) / (x_hi - x_lo)[:, None]

        y_new = slope * (x_new - x_lo)[:, None] + y_lo

        return y_new


def sampleFromDistr(data, sampleSize):
    t0 = time.time()
    sortedData = np.sort(np.array(data))
    yvals = np.arange(len(sortedData)) / float(len(sortedData))
    f = interp1dMod(yvals, sortedData)

    sample = []
    with click.progressbar(range(sampleSize), label='sampling') as bar:
        for i in bar:
            sample.append(float(f(random.uniform(0, max(yvals)))))
    t1 = time.time()
    print t1 - t0
    return sample

which results in the following error: AttributeError: 'int' object has no attribute 'clip'. What am I doing wrong?

sodiumnitrate
  • 2,899
  • 6
  • 30
  • 49
  • Why do you define `clip=np.clip`, but later use `x_new_indices.clip()`? While an array has a `clip` method, ordinary numbers don't. Without digging to the details, it appears that `bisect` is returning a number, not an array. – hpaulj Aug 20 '15 at 16:51
  • @hpaulj I had not realized that `bisect` was returning a number and thought I was not able to import the modules correctly. – sodiumnitrate Aug 20 '15 at 17:21

1 Answers1

4

You're doing a lot of strange things here. :-)

You're computing max(yvals) again for each point, which means you have to loop over len(sortedData) numbers each time, and you're using the Python function to do it; you're not taking advantage of vectorization, but using slow Python-level loops; even your progressbar seems to slow things down. In your new code, you're using bisect.bisect, but that will only return a single Python integer, so calling the result x_new_indices seems strange.

Anyway, if we're restricting ourselves to numpy (and not subclassing scipy.stats.rv_continuous) I'd do something like

def sampleFromDistr_vectorized(data, sampleSize):
    t0 = time.time()

    # forming empirical CDF
    sortedData = np.sort(np.array(data))
    yvals = np.arange(len(sortedData)) / float(len(sortedData))

    # linear interpolation object for the inverse of the cdf
    f = interp1d(yvals, sortedData)

    # get the random numbers
    r = np.random.uniform(0, yvals.max(), sampleSize)
    # interpolate
    sample = f(r)
    t1 = time.time()
    print(t1 - t0)
    return sample

which gives me

>>> test = np.random.random(10**3)
>>> sample = sampleFromDistr(test, 10**4)
sampling  [####################################]  100%          
1.4801428318023682
>>> sample = sampleFromDistr_onemax_noprogressbar(test, 10**4)
0.26591944694519043
>>> sample = sampleFromDistr_vectorized(test, 10**4)
0.00497126579284668

and so

>>> test = np.random.random(10**6)
>>> sample = sampleFromDistr_vectorized(test, 10**6)
0.3583641052246094

versus

>>> sample = sampleFromDistr(test, 10**6)
sampling  [------------------------------------]    0%  12:23:25

(At less than a second I'd just run with it, but if this started using too much time I'd use an alias method, which is O(1) after preprocessing. But here it's not worth the headache.)

DSM
  • 342,061
  • 65
  • 592
  • 494