0

I'm currently trying to use a loop of some description to determine summary statistics (min/max/med) for a NumPy array. The NumPy array is three wide by 20 long. (not quite sure how to phrase that). The loop I'm trying to implement needs to sort through each "column" and find the min/max/med for each column. I've included an example of the array too P.S as you'll be able to see im attempting a data frame approach but I'm not tied to that idea and would be open to whatever works (as long as it uses a loop).

print(linnerud.data)

[[  5. 162.  60.]
 [  2. 110.  60.]
 [ 12. 101. 101.]
 [ 12. 105.  37.]
 [ 13. 155.  58.]
 [  4. 101.  42.]
 [  8. 101.  38.]
 [  6. 125.  40.]
 [ 15. 200.  40.]
 [ 17. 251. 250.]
 [ 17. 120.  38.]
 [ 13. 210. 115.]
 [ 14. 215. 105.]
 [  1.  50.  50.]
 [  6.  70.  31.]
 [ 12. 210. 120.]
 [  4.  60.  25.]
 [ 11. 230.  80.]
 [ 15. 225.  73.]
 [  2. 110.  43.]]
  • Does this answer your question? [numpy max vs amax vs maximum](https://stackoverflow.com/questions/33569668/numpy-max-vs-amax-vs-maximum) – sushanth May 23 '20 at 02:49
  • Looping over a numpy array is generally going to be much slower than using numpy functions or methods. Is there a specific reason you need a loop? – Blckknght May 23 '20 at 03:22
  • it is a specific requirement of the assignment I'm doing that a loop needs to be used. As seen below I've already answered it without using a loop but I won't get any marks because I don't use one. – TaipanCodes May 23 '20 at 03:28

3 Answers3

0

You can just do

np.min(data, axis=0)
np.max(data, axis=0)
np.median(data, axis=0)

Note axis=0 means "along the first axis" which will give results by column. For your data,

import sklearn.datasets
linnerud = sklearn.datasets.load_linnerud()

np.min(linnerud.data, axis=0)
# [ 1. 50. 25.]

np.max(linnerud.data, axis=0)
# [ 17. 251. 250.]

np.median(linnerud.data, axis=0)
# [ 11.5 122.5  54. ]
stevemo
  • 1,077
  • 6
  • 10
  • I used a nearly identical approach and got the same results however I couldn't figure out how to achieve these results whilst using a loop. So whilst it is a sound approach it doesn't us a loop. – TaipanCodes May 23 '20 at 03:08
  • Ah I misunderstood. You are trying to implement an algorithm from scratch. – stevemo May 23 '20 at 18:01
0

What you need to do is to look up sorting algorithms and sort the array columns. Bubble sort or selection sort might be a good start, for instance. Sorting would make finding the median much easier.
Since you access the columns vertically, or along axis = 1, the second index will remain constant and the first index will vary (corresponding to axis = 0). So typically your loops will look like

for i in range(something):
    for j in range(something_else):
        # do what you please with one element
        element = array[j, i]    #along a column

The ranges will change according to your application.

amzon-ex
  • 1,645
  • 1
  • 6
  • 28
0

Min and max are quite easy. Iterate through the items setting the min and max to the new value if required. For the median sort the columns (at least to just over half way) and return the middle item ( length is odd, or the average of the two items closest to the middle ( length is even ).

import numpy as np

arr = np.array([[  5., 162.,  60.], [  2., 110.,  60.], [ 12., 101., 101.], 
    [ 12., 105.,  37.], [ 13., 155.,  58.], [  4., 101.,  42.], 
    [  8., 101.,  38.], [  6., 125.,  40.], [ 15., 200.,  40.], 
    [ 17., 251., 250.], [ 17., 120.,  38.], [ 13., 210., 115.], 
    [ 14., 215., 105.], [  1.,  50.,  50.], [  6.,  70.,  31.], 
    [ 12., 210., 120.], [  4.,  60.,  25.], [ 11., 230.,  80.], 
    [ 15., 225.,  73.], [  2., 110.,  43.]] )

def minmax( arr ):
    """  Return min & max arrays of a 2d array. """
    mn = arr[0].copy()
    mx = arr[0].copy()
    for row in arr[1:]:
        mn = np.minimum(mn, row) # Item by item minimum
        mx = np.maximum(mx, row) # item by item maximum
    return mn, mx    

def median( arr ):
    data = arr.copy()  # data will be modified. 
    # Sort lowest 'half'+1 of data.  Once the middle two items are known 
    # the median can be calculated so no need top sort all.
    size = len(data)
    for ix, d in enumerate( data[:size // 2 + 1 ] ):
        mn = d      # Set mn to the next item in the array
        mnix = ix   # Set mnix to the next index
        # Find min in the rest of the array
        for jx, s in enumerate( data[ ix+1: ] ):
            if s < mn:             # If a new mn 
                mn = s             # Set mn to s
                mnix = jx + ix+1   # Set mnix to the index
        # Swap contents of data[ix] and data[mnix], the minimum found.
        # If mnix == ix it still works.
        data[ix], data[mnix] = mn, data[ix]
    key0 = (size - 1) // 2
    key1 = size - 1 - key0
    return 0.5 * ( data[key0] + data[key1] )
    # Return average of the two middle keys 
    # ( the keys are the same if a odd number of items in arr)

def medians( arr ):
    res = np.zeros_like( arr[0] )
    # Iterate through arr transposed. i.e. column by column
    for ix, col in enumerate( arr.T ):
        res[ix] = median( col )
    return res

print( minmax( arr ), medians( arr ) )
# (array([ 1., 50., 25.]), array([ 17., 251., 250.])) [ 11.5 122.5  54. ]
# Numpy versions
print( arr.min( axis = 0 ), arr.max( axis = 0 ), np.median( arr, axis = 0 ))
# [ 1. 50. 25.] [ 17. 251. 250.] [ 11.5 122.5  54. ]

It shows how much effort numpy saves you and it runs faster too.

Tls Chris
  • 3,564
  • 1
  • 9
  • 24
  • Thankyou! This works nearly perfectly! I'm just going to change how the different results appear at the end print statement but otherwise that has solved my issue. It's just annoying that i have to use a loop when i'd already found the answer using a previously suggested method. – TaipanCodes May 24 '20 at 03:12
  • It's useful to know how the functions can be implemented even when they sit in libraries. – Tls Chris May 24 '20 at 20:40