0

I have some data currently stored in 3 lists, let's call them a, b and c. The lists all contain n elements. I would like to take a random sample of my data, say of size sample_n, to create some smaller dataset to play around with but I want to take the same random sample from each list. That is, I want to randomly select the same elements from each list. So if I randomly select element i, I would like to take element i from each list (a[i], b[i] and c[i]). I do not want to generate 3 sets of random numbers so that the three lists' elements do not match. E.g.running this random sampling for each set separately is not what I want.

I would think all I need to do is generate a separate list of random numbers, random_list, that is of length sample_n and then do something like

for element in range(len(random_list)):
      sample_a[element] = a[random_list[element]]
      sample_b[element] = b[random_list[element]]
      sample_c[element] = c[random_list[element]]

However, I don't know how to generate a random number list! And also I was wondering if there was a more efficient method than what i was thinking of here.

2 Answers2

1

You can do a shuffling of indices. First way:

import numpy as np
indices = list(range(20))
np.random.shuffle(indices)
indices
[9, 13, 0, 19, 17, 10, 14, 5, 7, 18, 8, 3, 16, 4, 15, 11, 12, 6, 1, 2]

Second way:

import random
indices = list(range(20))
random.shuffle(indices)
indices
[5, 3, 11, 7, 19, 12, 0, 13, 2, 4, 10, 18, 1, 16, 17, 14, 8, 6, 9, 15]

Or, in case indices can repeat:

np.random.randint(1,5, size=20)
array([1, 2, 3, 4, 3, 3, 4, 3, 1, 4, 2, 3, 4, 2, 3, 2, 1, 4, 1, 3])

Efficiency. It's faster to store sample_a, sample_b, sample_c in 2D array:

X = np.array([['a','b','c'], ['d','e','f'], ['g','h','i'], ['j','k','l'], ['m','n','o'], ['p', 'q','r'], ['s', 't', 'u']])
idx = np.random.randint(0, len(X), size=7)

and then access its columns using X[idx,0], X[idx,1], X[idx,2]

mathfux
  • 5,759
  • 1
  • 14
  • 34
  • Thank you! I think the third option is what i am going to go for. My code is now: `sample_n = 1000` `data = np.column_stack((a, b, c))` `idx = np.random.randint(0, len(a), size = sample_n)` `idx = np.reshape(idx, (sample_size, 1))` #Force 1D vector Final dumb question (on this topic at least). I am now struggling to copy this sample_data into a new array. I have tried: `for element in range(len(idx)):` `sample_data[idx[element], :] = data[idx[element], :]` But get the error "list indices must be integers or slices, not tuple". Any ideas? Sorry for the formatting! – imwellhowareyou Aug 18 '20 at 12:32
  • reshape is unnecessary. You could just use `sample_data[idx,:]=data[idx,:]` – mathfux Aug 18 '20 at 12:43
  • Thank you @mathfux, this is really good to know as my Python coding is currently very inelegant. I am still getting an error but I believe I ran into this same error about a week ago (for some reason it's not calling the `idx` elements as integer values) so hopefully I can learn from my past mistakes and fix this up! – imwellhowareyou Aug 18 '20 at 13:14
0

Sorry, having searched earlier and not found anything I have just now immediately found this:

Random sample of paired lists in Python

  • note that in numpy, instead of `zip` you should probably go for `np.column_stack` or something similar – Adam.Er8 Aug 18 '20 at 11:16
  • Thank you Adam. I will look into this too. I have come across it but am slightly intimidated by using `zip`. It seems quite a powerful command. – imwellhowareyou Aug 18 '20 at 12:41