1

Hi I'm new to Python & Numpy and I'd like to ask what is the most efficient way to split a ndarray into 3 parts: 20%, 60% and 20%

    import numpy as np
    row_indices = np.random.permutation(10)

Let's assume the ndarray has 10 items: [7 9 3 1 2 4 5 6 0 8] The expected results are the ndarray separated into 3 parts like part1, part2 and part3.
part1: [7 9]
part2: [3 1 2 4 5]
part3: [0 8]

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Tom Souza
  • 33
  • 7

2 Answers2

2

Here's one way -

# data array
In [85]: a = np.array([7, 9, 3, 1, 2, 4, 5, 6, 0, 8])

# percentages (ratios) array
In [86]: p = np.array([0.2,0.6,0.2]) # must sum upto 1

In [87]: np.split(a,(len(a)*p[:-1].cumsum()).astype(int))
Out[87]: [array([7, 9]), array([3, 1, 2, 4, 5, 6]), array([0, 8])]

Alternative to np.split :

np.split could be slower when working with large data, so, we could alternatively use a loop there -

split_idx = np.r_[0,(len(a)*p.cumsum()).astype(int)]
out = [a[i:j] for (i,j) in zip(split_idx[:-1],split_idx[1:])]
Divakar
  • 218,885
  • 19
  • 262
  • 358
2

I normally just go for the most obvious solution, although there are much fancier ways to do the same. It takes a second to implement and doesn't even require debugging (since it's extremely simple)

part1 = [a[i, ...] for i in range(int(a.shape[0] * 0.2))]
part2 = [a[i, ...] for i in range(int(a.shape[0] * 0.2), int(len(a) * 0.6))]
part3 = [a[i, ...] for i in range(int(a.shape[0] * 0.6), len(a))]

A few things to notice though

  1. This is rounded and therefore you could get something which is only roughly a 20-60-20 split
  2. You get back a list of element so you might have to re-numpyfy them with np.asarray()
  3. You can use this method for indexing multiple objects (e.g. labels and inputs) for the same elements
  4. If you get the indices once before the splits (indices = list(range(a.shape[0]))) you could also shuffle them thus taking care of data shuffling at the same time
Ido_f
  • 689
  • 1
  • 11
  • 29