14

Is it possible to calculate the mean of multiple arrays, when they may have different lengths? I am using numpy. So let's say I have:

numpy.array([[1, 2, 3, 4, 8],    [3, 4, 5, 6, 0]])
numpy.array([[5, 6, 7, 8, 7, 8], [7, 8, 9, 10, 11, 12]])
numpy.array([[1, 2, 3, 4],       [5, 6, 7, 8]])

Now I want to calculate the mean, but ignoring elements that are 'missing' (Naturally, I can not just append zeros as this would mess up the mean)

Is there a way to do this without iterating through the arrays?

PS. These arrays are all 2-D, but will always have the same amount of coordinates for that array. I.e. the 1st array is 5 and 5, 2nd is 6 and 6, 3rd is 4 and 4.

An example:

np.array([[1, 2],    [3, 4]])
np.array([[1, 2, 3], [3, 4, 5]])
np.array([[7],       [8]])

This must give

(1+1+7)/3  (2+2)/2   3/1
(3+3+8)/3  (4+4)/2   5/1

And graphically:

[1, 2]    [1, 2, 3]    [7]
[3, 4]    [3, 4, 5]    [8]

Now imagine that these 2-D arrays are placed on top of each other with coordinates overlapping contributing to that coordinate's mean.

CDspace
  • 2,639
  • 18
  • 30
  • 36
hjweide
  • 11,893
  • 9
  • 45
  • 49
  • What is wrong with `mean()`? I am not sure I understand what you want, or what `mean()` is not doing for you. – George Apr 07 '12 at 20:49
  • "ignoring elements that are missing" is still pretty vague. Could you give a very simple example with both data and the value that you'd like to produce for that data? – Nolen Royalty Apr 07 '12 at 20:49
  • Your question is not very clear. Can you clarify how you intend to calculate the mean and what is your expected result? – Abhijit Apr 07 '12 at 20:50
  • mean() does not seem to work if the arrays are of different lengths – hjweide Apr 07 '12 at 20:50
  • I have added an example to show what I mean, I hope it helps @Nolen – hjweide Apr 07 '12 at 20:58
  • Why would it be (3+3+8)/3 instead of (2+2+8)/3? What puts 8 in position 3 instead of position 2? – Nolen Royalty Apr 07 '12 at 21:00
  • @NolenRoyalty my graphical description should help now – hjweide Apr 07 '12 at 21:06
  • What do you mean "without iterating through the arrays". How do you want to find the means? By magic? – Joel Cornett Apr 07 '12 at 21:08
  • @Joel, numpy provides a np.mean() function which returns the mean of one or more arrays in the dimension that you specify. [Here](http://docs.scipy.org/doc/numpy-1.6.0/reference/generated/numpy.mean.html) – hjweide Apr 07 '12 at 21:11
  • @Casper: How do you suppose the `np.mean()` function works? – Joel Cornett Apr 07 '12 at 22:09
  • @Joel, by iterating I meant manually, i.e. with for-loops. I know numpy uses a similar method, but I wanted the convenience of not setting up these variable length for-loops myself, and instead having numpy deal with it. – hjweide Apr 08 '12 at 07:21
  • @Casper: See my posted answer for the 'manual' version of this iteration. As you can see, it is not very complex/difficult to code. If you're doing this often, I would probably just make this into a function. – Joel Cornett Apr 08 '12 at 15:42

4 Answers4

14

I often needed this for plotting mean of performance curves with different lengths.

Plot of multiple curves with different lengths

Solved it with simple function (based on answer of @unutbu):

def tolerant_mean(arrs):
    lens = [len(i) for i in arrs]
    arr = np.ma.empty((np.max(lens),len(arrs)))
    arr.mask = True
    for idx, l in enumerate(arrs):
        arr[:len(l),idx] = l
    return arr.mean(axis = -1), arr.std(axis=-1)

y, error = tolerant_mean(list_of_ys_diff_len)
ax.plot(np.arange(len(y))+1, y, color='green')

So applying that function to the list of above-plotted curves yields the following:

Plot of mean and std of different curves with different lengths

dsalaj
  • 2,857
  • 4
  • 34
  • 43
12

numpy.ma.mean allows you to compute the mean of non-masked array elements. However, to use numpy.ma.mean, you have to first combine your three numpy arrays into one masked array:

import numpy as np
x = np.array([[1, 2], [3, 4]])
y = np.array([[1, 2, 3], [3, 4, 5]])
z = np.array([[7], [8]])

arr = np.ma.empty((2,3,3))
arr.mask = True
arr[:x.shape[0],:x.shape[1],0] = x
arr[:y.shape[0],:y.shape[1],1] = y
arr[:z.shape[0],:z.shape[1],2] = z
print(arr.mean(axis = 2))

yields

[[3.0 2.0 3.0]
 [4.66666666667 4.0 5.0]]
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 1
    Wow, thanks, I would never have found that! I wish there was an easier way, but I guess this will have to do. Thanks! – hjweide Apr 07 '12 at 21:32
3

The below function also works by adding columns of arrays of different lengths:

def avgNestedLists(nested_vals):
    """
    Averages a 2-D array and returns a 1-D array of all of the columns
    averaged together, regardless of their dimensions.
    """
    output = []
    maximum = 0
    for lst in nested_vals:
        if len(lst) > maximum:
            maximum = len(lst)
    for index in range(maximum): # Go through each index of longest list
        temp = []
        for lst in nested_vals: # Go through each list
            if index < len(lst): # If not an index error
                temp.append(lst[index])
        output.append(np.nanmean(temp))
    return output

Going off of your first example:

avgNestedLists([[1, 2, 3, 4, 8], [5, 6, 7, 8, 7, 8], [1, 2, 3, 4]])

Outputs:

[2.3333333333333335,
 3.3333333333333335,
 4.333333333333333,
 5.333333333333333,
 7.5,
 8.0]

The reason np.amax(nested_lst) or np.max(nested_lst) was not used in the beginning to find the max value is because it will return an array if the nested lists are of different sizes.

Colonel_Old
  • 852
  • 9
  • 15
1

OP, I know you were looking for a non-iterative built-in solution, but the following really only takes 3 lines (2 if you combine transpose and means but then it just gets messy):

arrays = [
    np.array([1,2], [3,4]),
    np.array([1,2,3], [3,4,5]),
    np.array([7], [8])
    ]

mean = lambda x: sum(x)/float(len(x)) 

transpose = [[item[i] for item in arrays] for i in range(len(arrays[0]))]

means = [[mean(j[i] for j in t if i < len(j)) for i in range(len(max(t, key = len)))] for t in transpose]

Outputs:

>>>means
[[3.0, 2.0, 3.0], [4.666666666666667, 4.0, 5.0]]
Joel Cornett
  • 24,192
  • 9
  • 66
  • 88
  • Thanks, neat and elegant! I don't really need to calculate this often, but will note it for future reference. – hjweide Apr 08 '12 at 20:25
  • @Casper: No problem. FYI, I was on CodeGolf and found another nifty way to get the transpose of an array. You could probably substitute it for the one I have up there: `transpose = zip(*arrays)` – Joel Cornett Apr 08 '12 at 20:57
  • 1
    Well in numpy I tend to just use np.transpose(my_array), and apparently you can use my_array.T as well- although I prefer the former for readability. – hjweide Apr 09 '12 at 06:55