79

How can I shuffle a multidimensional array by row only in Python (so do not shuffle the columns).

I am looking for the most efficient solution, because my matrix is very huge. Is it also possible to do this highly efficient on the original array (to save memory)?

Example:

import numpy as np
X = np.random.random((6, 2))
print(X)
Y = ???shuffle by row only not colls???
print(Y)

What I expect now is original matrix:

[[ 0.48252164  0.12013048]
 [ 0.77254355  0.74382174]
 [ 0.45174186  0.8782033 ]
 [ 0.75623083  0.71763107]
 [ 0.26809253  0.75144034]
 [ 0.23442518  0.39031414]]

Output shuffle the rows not cols e.g.:

[[ 0.45174186  0.8782033 ]
 [ 0.48252164  0.12013048]
 [ 0.77254355  0.74382174]
 [ 0.75623083  0.71763107]
 [ 0.23442518  0.39031414]
 [ 0.26809253  0.75144034]]
robert
  • 1,921
  • 2
  • 17
  • 27
  • Option 1: shuffled view onto an array. I guess that would mean a custom implementation. (almost) no impact on memory usage, Obv. some impact at runtime. It really depends on how you **intend to use** this matrix. – Dima Tisnek Feb 26 '16 at 09:19
  • 3
    Option 2: shuffle array in place. `np.random.shuffle(x)`, docs state that "this function only shuffles the array along the first index of a multi-dimensional array", which is good enough for you, right? Obv., some time taken at startup, but from that point, it's as fast as original matrix. – Dima Tisnek Feb 26 '16 at 09:21
  • Compare to `np.random.shuffle(x)`, **shuffling index of nd-array and getting data from shuffled index** is more efficient way to solve this problem. For more details comparision refer my answer [bellow](http://stackoverflow.com/questions/35646908/numpy-shuffle-multidimensional-array-by-row-only-keep-column-order-unchanged/43716153#43716153) – John May 01 '17 at 08:20

5 Answers5

84

You can use numpy.random.shuffle().

This function only shuffles the array along the first axis of a multi-dimensional array. The order of sub-arrays is changed but their contents remains the same.

In [2]: import numpy as np                                                                                                                                                                                  

In [3]:                                                                                                                                                                                                     

In [3]: X = np.random.random((6, 2))                                                                                                                                                                        

In [4]: X                                                                                                                                                                                                   
Out[4]: 
array([[0.71935047, 0.25796155],
       [0.4621708 , 0.55140423],
       [0.22605866, 0.61581771],
       [0.47264172, 0.79307633],
       [0.22701656, 0.11927993],
       [0.20117207, 0.2754544 ]])

In [5]: np.random.shuffle(X)                                                                                                                                                                                

In [6]: X                                                                                                                                                                                                   
Out[6]: 
array([[0.71935047, 0.25796155],
       [0.47264172, 0.79307633],
       [0.4621708 , 0.55140423],
       [0.22701656, 0.11927993],
       [0.20117207, 0.2754544 ],
       [0.22605866, 0.61581771]])

For other functionalities you can also check out the following functions:

The function random.Generator.permuted is introduced in Numpy's 1.20.0 Release.

The new function differs from shuffle and permutation in that the subarrays indexed by an axis are permuted rather than the axis being treated as a separate 1-D array for every combination of the other indexes. For example, it is now possible to permute the rows or columns of a 2-D array.

Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • I wonder if this could be sped up by numpy, maybe taking advantage of concurrency. – Georg Schölly Feb 26 '16 at 08:34
  • @GeorgSchölly I thinks this is the most available optimized approach in python. If you want to speed it up you need to make changes on algorithm. – Mazdak Feb 26 '16 at 08:37
  • 1
    I completely agree. I just realized that you are using `np.random` instead of the Python `random` module which also contains a shuffle function. I'm sorry for causing confusion. – Georg Schölly Feb 26 '16 at 11:21
  • This shuffle is not always working, see my new answer here below. Why is it not always working? – robert Feb 26 '16 at 14:54
  • 1
    This method returns a `NoneType` object - any solution for keeping the object a numpy array? **EDIT**: sorry all good: I had `X = np.random.shuffle(X)`, which returns a `NoneType` object, but the key was just `np.random.shuffle(X)`, since it is shuffled *in place*. – MJimitater Nov 06 '20 at 14:55
  • What about the labels Y. For example if I wanted to use this on a scikitlearn dataset. How would I get the labeled correctly shuffled to match? – wwjdm Jun 16 '21 at 18:20
30

You can also use np.random.permutation to generate random permutation of row indices and then index into the rows of X using np.take with axis=0. Also, np.take facilitates overwriting to the input array X itself with out= option, which would save us memory. Thus, the implementation would look like this -

np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)

Sample run -

In [23]: X
Out[23]: 
array([[ 0.60511059,  0.75001599],
       [ 0.30968339,  0.09162172],
       [ 0.14673218,  0.09089028],
       [ 0.31663128,  0.10000309],
       [ 0.0957233 ,  0.96210485],
       [ 0.56843186,  0.36654023]])

In [24]: np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X);

In [25]: X
Out[25]: 
array([[ 0.14673218,  0.09089028],
       [ 0.31663128,  0.10000309],
       [ 0.30968339,  0.09162172],
       [ 0.56843186,  0.36654023],
       [ 0.0957233 ,  0.96210485],
       [ 0.60511059,  0.75001599]])

Additional performance boost

Here's a trick to speed up np.random.permutation(X.shape[0]) with np.argsort() -

np.random.rand(X.shape[0]).argsort()

Speedup results -

In [32]: X = np.random.random((6000, 2000))

In [33]: %timeit np.random.permutation(X.shape[0])
1000 loops, best of 3: 510 µs per loop

In [34]: %timeit np.random.rand(X.shape[0]).argsort()
1000 loops, best of 3: 297 µs per loop

Thus, the shuffling solution could be modified to -

np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)

Runtime tests -

These tests include the two approaches listed in this post and np.shuffle based one in @Kasramvd's solution.

In [40]: X = np.random.random((6000, 2000))

In [41]: %timeit np.random.shuffle(X)
10 loops, best of 3: 25.2 ms per loop

In [42]: %timeit np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)
10 loops, best of 3: 53.3 ms per loop

In [43]: %timeit np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)
10 loops, best of 3: 53.2 ms per loop

So, it seems using these np.take based could be used only if memory is a concern or else np.random.shuffle based solution looks like the way to go.

Community
  • 1
  • 1
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • 1
    This sounds nice. Can you add a timing information to your post, of your np.take v.s. standard shuffle? The np.shuffle on my system is faster (27.9ms) vs your take (62.9 ms), but as I read in your post, there is a memory advantage? – robert Feb 26 '16 at 08:58
  • 2
    @robert Just added, check it out! – Divakar Feb 26 '16 at 09:02
13

After a bit of experiment (i) found the most memory and time-efficient way to shuffle data(row-wise)in an nD array. First, shuffle the index of an array then, use the shuffled index to get the data. e.g.

rand_num2 = np.random.randint(5, size=(6000, 2000))
perm = np.arange(rand_num2.shape[0])
np.random.shuffle(perm)
rand_num2 = rand_num2[perm]

in more details
Here, I am using memory_profiler to find memory usage and python's builtin "time" module to record time and comparing all previous answers

def main():
    # shuffle data itself
    rand_num = np.random.randint(5, size=(6000, 2000))
    start = time.time()
    np.random.shuffle(rand_num)
    print('Time for direct shuffle: {0}'.format((time.time() - start)))
    
    # Shuffle index and get data from shuffled index
    rand_num2 = np.random.randint(5, size=(6000, 2000))
    start = time.time()
    perm = np.arange(rand_num2.shape[0])
    np.random.shuffle(perm)
    rand_num2 = rand_num2[perm]
    print('Time for shuffling index: {0}'.format((time.time() - start)))
    
    # using np.take()
    rand_num3 = np.random.randint(5, size=(6000, 2000))
    start = time.time()
    np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3)
    print("Time taken by np.take, {0}".format((time.time() - start)))

Result for Time

Time for direct shuffle: 0.03345608711242676   # 33.4msec
Time for shuffling index: 0.019818782806396484 # 19.8msec
Time taken by np.take, 0.06726956367492676     # 67.2msec

Memory profiler Result

Line #    Mem usage    Increment   Line Contents
================================================
    39  117.422 MiB    0.000 MiB   @profile
    40                             def main():
    41                                 # shuffle data itself
    42  208.977 MiB   91.555 MiB       rand_num = np.random.randint(5, size=(6000, 2000))
    43  208.977 MiB    0.000 MiB       start = time.time()
    44  208.977 MiB    0.000 MiB       np.random.shuffle(rand_num)
    45  208.977 MiB    0.000 MiB       print('Time for direct shuffle: {0}'.format((time.time() - start)))
    46                             
    47                                 # Shuffle index and get data from shuffled index
    48  300.531 MiB   91.555 MiB       rand_num2 = np.random.randint(5, size=(6000, 2000))
    49  300.531 MiB    0.000 MiB       start = time.time()
    50  300.535 MiB    0.004 MiB       perm = np.arange(rand_num2.shape[0])
    51  300.539 MiB    0.004 MiB       np.random.shuffle(perm)
    52  300.539 MiB    0.000 MiB       rand_num2 = rand_num2[perm]
    53  300.539 MiB    0.000 MiB       print('Time for shuffling index: {0}'.format((time.time() - start)))
    54                             
    55                                 # using np.take()
    56  392.094 MiB   91.555 MiB       rand_num3 = np.random.randint(5, size=(6000, 2000))
    57  392.094 MiB    0.000 MiB       start = time.time()
    58  392.242 MiB    0.148 MiB       np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3)
    59  392.242 MiB    0.000 MiB       print("Time taken by np.take, {0}".format((time.time() - start)))
John
  • 1,212
  • 1
  • 16
  • 30
  • Hi, can you provide the code that produce this output? – AvidLearner Feb 07 '18 at 07:13
  • 2
    i lost the code to produce [memory_profiler](https://pypi.python.org/pypi/memory_profiler) output. But it can be very easily reproduced by following steps in the given link. – John Feb 08 '18 at 11:40
  • What I like about this answer is that if I have two matched arrays (which coincidentally I do) then I can shuffle both of them and ensure that data in corresponding positions still match. This is useful for randomising the order of my training set – Spoonless Dec 14 '18 at 09:59
8

I tried many solutions, and at the end I used this simple one:

from sklearn.utils import shuffle
x = np.array([[1, 2],
              [3, 4],
              [5, 6]])
print(shuffle(x, random_state=0))

output:

[
[5 6]  
[3 4]  
[1 2]
]

if you have 3d array, loop through the 1st axis (axis=0) and apply this function, like:

np.array([shuffle(item) for item in 3D_numpy_array])
Minions
  • 5,104
  • 5
  • 50
  • 91
3

You can shuffle a two dimensional array A by row using the np.vectorize() function:

shuffle = np.vectorize(np.random.permutation, signature='(n)->(n)')

A_shuffled = shuffle(A)