2
ex_array = [-8.23294593e-02, -4.07239507e-02,  6.08131029e-02,  2.72433402e-02,
   -4.73587631e-02,  5.15452252e-02,  1.32902476e-01,  1.22322232e-01,
    2.71845990e-02, -1.16927038e-01, -2.62239877e-01, -1.46526396e-01,
   -1.82859136e-01, -1.02089602e-01, -1.91863501e-04, -5.42572200e-02,
   -1.41798506e-01,  2.32538185e-02,  1.44525705e-01,  1.33945461e-01,
    5.01618120e-02, -1.32664337e-01, -2.97395262e-01, -1.02531532e-01,
   -7.80204566e-02, -5.46991495e-02,  1.05868862e-01,  7.25526818e-03,
    5.04192997e-02,  7.41281286e-02,  1.75069159e-01,  1.64488914e-01,
    7.55396024e-02, -6.23800645e-02, -1.76950023e-01, -5.91491004e-02,
   -4.00535768e-02,  6.59473071e-04,  5.98125666e-02, -1.49608356e-02,
   -1.45519585e-02,  1.49876707e-01,  1.92880709e-01,  2.33158881e-01,
    7.59751625e-02, -2.46659059e-02, -1.40025102e-01, -3.02416639e-02]

I need to compute the median for every 12 values. Each value represents a month (from January to December), so I would like to obtain the median for each month of the year. Like this:

Representation of what I'm asking

Approaches:

  • I could convert the array to a dataframe, and add a new column representing each month. Later, grouping by month and computing the median. But I feel it has to be an easier solution.

  • Another solution I thought to convert to a dataframe and slice every 12 values, each time starting from a different value. It works but I'm having problems obtaining a workable array. Adding example for the first three months:

'''

sol_array = []

sol_array.append(pd.DataFrame(ex_array).iloc[0::12].median().to_string())
sol_array.append(pd.DataFrame(ex_array).iloc[1::12].median().to_string())
sol_array.append(pd.DataFrame(ex_array).iloc[2::12].median().to_string())

But this is the outcome. The 0 and the apostrophes shouldn't be there.

['0   -0.075844',
 '0   -0.089111',
 '0    0.042705',
 '0    0.002147',
 '0   -0.010528',
 '0    0.109443',
 '0    0.198334',
 '0    0.20983',
 '0    0.075139',
 '0   -0.062405']

So, do you know another way to obtain the same outcome. I only have 120 values, so it is still viable to arrange the groups manually (only 10 groups) but I feel it's not an ideal solution.

Or, do you know how to correct the above method I and obtain a workable array?

goodname
  • 49
  • 4
  • have you tried rolling median option? https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.median.html – NoobVB May 21 '22 at 11:16
  • @NoobVB doesn't work. (For what I understand) that's for consecutive values. I want the median for every 12 values in a non-consecutive position. – goodname May 21 '22 at 11:22

2 Answers2

3

Let us use numpy operations:

np.median(np.reshape(ex_array, (12, -1), 'F'), axis=1)

array([-0.08017496, -0.04771155,  0.06031283, -0.00385278, -0.03095536,
        0.06283668,  0.15979743,  0.14921719,  0.06285071, -0.08965355,
       -0.21959495, -0.08084032])
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • 2
    @goodname As you don't want a DataFrame as output, this is an ideal approach! – mozway May 21 '22 at 11:43
  • @mozway out of curiosity I timed the three alternatives and found using statistics to be a >4x faster solution than this. See the edit to my answer. – Nick May 21 '22 at 22:40
  • @Nick on which size of data? I would test on tens of thousands of rows to expect seeing the benefit of numpy. Otherwise the overhead always makes pure python solutions faster ;) – mozway May 22 '22 at 01:54
  • @mozway yeah, you're right, somewhere around 5,000 (YMMV) entries numpy becomes faster. Still, that's huge compared to OPs data (which only has 120 rows), where the pure python approach is 3.5x faster. – Nick May 22 '22 at 03:44
2

A couple of other alternatives:

list(pd.DataFrame(ex_array).groupby(lambda i:i%12).median()[0])

or

import statistics
[statistics.median(ex_array[i] for i in range(j, len(ex_array), 12)) for j in range(12)]

In both cases the output (for the data in your question) is

[
 -0.08017495795, -0.0477115501, 0.06031283475, -0.00385278371,
 -0.030955360799999998, 0.0628366769, 0.15979743200000002, 0.14921718750000001,
 0.0628507072, -0.08965355124999999, -0.21959495, -0.0808403162
]

Out of interest I timed the three alternatives (including @Shubham answer) using code like this (adapted for each version):

import timeit
timeit.timeit(setup='''
import statistics
ex_array = [-8.23294593e-02, -4.07239507e-02,  6.08131029e-02,  2.72433402e-02,
   -4.73587631e-02,  5.15452252e-02,  1.32902476e-01,  1.22322232e-01,
    2.71845990e-02, -1.16927038e-01, -2.62239877e-01, -1.46526396e-01,
   -1.82859136e-01, -1.02089602e-01, -1.91863501e-04, -5.42572200e-02,
   -1.41798506e-01,  2.32538185e-02,  1.44525705e-01,  1.33945461e-01,
    5.01618120e-02, -1.32664337e-01, -2.97395262e-01, -1.02531532e-01,
   -7.80204566e-02, -5.46991495e-02,  1.05868862e-01,  7.25526818e-03,
    5.04192997e-02,  7.41281286e-02,  1.75069159e-01,  1.64488914e-01,
    7.55396024e-02, -6.23800645e-02, -1.76950023e-01, -5.91491004e-02,
   -4.00535768e-02,  6.59473071e-04,  5.98125666e-02, -1.49608356e-02,
   -1.45519585e-02,  1.49876707e-01,  1.92880709e-01,  2.33158881e-01,
    7.59751625e-02, -2.46659059e-02, -1.40025102e-01, -3.02416639e-02]
''',
stmt='''
[statistics.median(ex_array[i] for i in range(j, len(ex_array), 12)) for j in range(12)]
''',
number=10000
)

Not unsurprisingly for the data in the question the numpy solution at 0.39sec was more than 10x faster than the pandas solution (5.45sec). However the statistics solution was >4x faster than numpy at 0.09sec. As the array gets bigger, this advantage is lost, however the break-even point is around 5,000 entries.

Nick
  • 138,499
  • 22
  • 57
  • 95