How to resample without replacement considering consecutive three as one unit for each choice

Question

The goal is to sample the n number of data points from the original population. But the original population has serial correlation (consider it as time series data) and I want to choose neighboring three as one unit for each choice. That is to say, the neighboring three data points have to be chosen each time. The choice has to be done without replacement.

It would repeat the choice until the number of sample data points reaches to n. Each chosen data point has to be unique. (Assume the population data points are all unique.)

How can I write this into code? I hope the code is fast.

def subsampling(self, population, size, consecutive = 3):
    #make seeds which doesn't have neighbors
    seed_samples = np.random.choice(population, 
                                    size = int(size/consecutive), 
                                    replace = False)
    target_samples = set(seed_samples)
    #add neighbors to each seed samples
    for dpoint in seed_samples:
        start = np.searchsorted(population, dpoint, side = 'right')
        neighbors = population[start:(start + consecutive -1)]
        target_samples.add(neighbors)
        
    return sorted(list(target_samples))

This code is my rough trial but it doesn't give the correct size because there can be duplicate.

You can check if the `neighbors` are already present in the `target_samples`, and if they are, you drop the `dpoint` and get a new one, you keep at it until you get n `dpoints` with no overlapping neighbors. It is significally longer but it gets the work done — federicober, Jul 02 '20 at 14:10
Can you share an example of given input and desired results + not desired result? explanation isnt too clear — yoav_aaa, Jul 02 '20 at 14:10

score 1 · Accepted Answer · answered Jul 02 '20 at 14:15

1

Suppose the population is 1000 entries and you want 200 non-overlapping triplets.

One simple method is: extract x[0], x[1],... x[199] 200 unique random numbers from 0 to 599 (600 = 1000-200*2). Sort the values and then required indexes for the triplets are:

0. x[0], x[0]+1, x[0]+2
1. x[1]+2, x[1]+3, x[1]+4
2. x[2]+4, x[2]+5, x[2]+6
...
n. x[n]+2*n, x[n]+2*n+1, x[n]+2*n+2
...
199. x[199]+398, x[199]+399, x[199]+400

answered Jul 02 '20 at 14:15

6502

112,025
15
165
265

Thanks. I thought I was doing something wrong but I couldn't figure it out. This idea is simple and elegant. – hbadger19042 Jul 02 '20 at 14:29

How to resample without replacement considering consecutive three as one unit for each choice

1 Answers1