0

The goal is to sample the n number of data points from the original population. But the original population has serial correlation (consider it as time series data) and I want to choose neighboring three as one unit for each choice. That is to say, the neighboring three data points have to be chosen each time. The choice has to be done without replacement.

It would repeat the choice until the number of sample data points reaches to n. Each chosen data point has to be unique. (Assume the population data points are all unique.)

How can I write this into code? I hope the code is fast.

def subsampling(self, population, size, consecutive = 3):
    #make seeds which doesn't have neighbors
    seed_samples = np.random.choice(population, 
                                    size = int(size/consecutive), 
                                    replace = False)
    target_samples = set(seed_samples)
    #add neighbors to each seed samples
    for dpoint in seed_samples:
        start = np.searchsorted(population, dpoint, side = 'right')
        neighbors = population[start:(start + consecutive -1)]
        target_samples.add(neighbors)
        
    return sorted(list(target_samples))

This code is my rough trial but it doesn't give the correct size because there can be duplicate.

hbadger19042
  • 151
  • 1
  • 8
  • Did you try something? – Bando Jul 02 '20 at 13:41
  • @Bandoleras I added my trial code. – hbadger19042 Jul 02 '20 at 14:02
  • You can check if the `neighbors` are already present in the `target_samples`, and if they are, you drop the `dpoint` and get a new one, you keep at it until you get n `dpoints` with no overlapping neighbors. It is significally longer but it gets the work done – federicober Jul 02 '20 at 14:10
  • Can you share an example of given input and desired results + not desired result? explanation isnt too clear – yoav_aaa Jul 02 '20 at 14:10

1 Answers1

1

Suppose the population is 1000 entries and you want 200 non-overlapping triplets.

One simple method is: extract x[0], x[1],... x[199] 200 unique random numbers from 0 to 599 (600 = 1000-200*2). Sort the values and then required indexes for the triplets are:

0. x[0], x[0]+1, x[0]+2
1. x[1]+2, x[1]+3, x[1]+4
2. x[2]+4, x[2]+5, x[2]+6
...
n. x[n]+2*n, x[n]+2*n+1, x[n]+2*n+2
...
199. x[199]+398, x[199]+399, x[199]+400
6502
  • 112,025
  • 15
  • 165
  • 265