3

A have a dataframe like this which represents a histogram, with each bin size being .003. I want to find the median value in the histogram, but I am unsure how. The median should be where half of the area of the histogram lies to the left, and half of the area to the right.

    Count Value
    0     0.262584
    0     0.265584
    0     0.268584
    1     0.271584
    1     0.274584
    2     0.277584
    2     0.280584
    1     0.283584
    0     0.286584
    3     0.289584
    3     0.292584
    10    0.295584
    7     0.298584
    22    0.301584
    2     0.304584
    17    0.307584
    19    0.310584
    19    0.313584
    32    0.316584
    17    0.319584
    17    0.322584
    25    0.325584
    32    0.328584
    18    0.331584
    24    0.334584
    43    0.337584
    38    0.340584
    30    0.343584
    21    0.346584
    53    0.349584
    45    0.352584
    36    0.355584
    46    0.358584
    58    0.361584
    34    0.364584
    71    0.367584
    50    0.370584
    73    0.373584
    60    0.376584
    97    0.379584
    67    0.382584
    84    0.385584
    70    0.388584
    106   0.391584
    91    0.394584
    148   0.397584
    70    0.400584
    166   0.403584
    88    0.406584
    155   0.409584
    126   0.412584
    128   0.415584
    181   0.418584
    81    0.421584
    216   0.424584
    95    0.427584
    193   0.430584
    67    0.433584
    164   0.436584
    68    0.439584
    133   0.442584
    60    0.445584
    92    0.448584
    38    0.451584
    63    0.454584
    40    0.457584
    43    0.460584
    24    0.463584
    32    0.466584
    19    0.469584
    11    0.472584
    11    0.475584
    13    0.478584
    4     0.481584
    6     0.484584
    3     0.487584
    5     0.490584
    3     0.493584
    4     0.496584
    5     0.499584
    1     0.502584
    3     0.505584
    1     0.508584
    1     0.511584
    1     0.514584
    0     0.517584
    1     0.520584
    3     0.523584
    0     0.526584
    0     0.529584
    0     0.532584
    0     0.535584
    0     0.538584
    1     0.541584
    1     0.544584
    3     0.547584
    0     0.550584
    0     0.553584
    0     0.556584
    1     0.559584
    0     0.562584
    0     0.565584
    0     0.568584
    1     0.571584
    0     0.574584
    0     0.577584
    0     0.580584
    1     0.583584
    0     0.586584
    1     0.589584

My histogram looks like this:

enter image description here

and the code I am using to find the median is this:

import pandas as pd
import numpy as np
df=pd.read_csv(r'C:\file')
list1=df['Value'].tolist()
median = np.median(list1)
print median

which returns 0.42758.

I am not sure if this is the correct method even though this value looks reasonable, so I wanted to see what peoples thoughts were here.

EDIT:

This is clearly not the correct method. Here is another example where it doesn't work:

1   0.283396
0   0.286396
0   0.289396
0   0.292396
0   0.295396
3   0.298396
0   0.301396
0   0.304396
0   0.307396
0   0.310396
0   0.313396
1   0.316396
0   0.319396
0   0.322396
0   0.325396
1   0.328396
1   0.331396
2   0.334396
0   0.337396
0   0.340396
1   0.343396
5   0.346396
0   0.349396
1   0.352396
0   0.355396
0   0.358396
0   0.361396
1   0.364396
0   0.367396
1   0.370396
1   0.373396
2   0.376396
0   0.379396
1   0.382396
0   0.385396
0   0.388396
1   0.391396
0   0.394396
1   0.397396
1   0.400396
3   0.403396
4   0.406396
0   0.409396
3   0.412396
0   0.415396
3   0.418396
2   0.421396
5   0.424396
1   0.427396
3   0.430396
8   0.433396
1   0.436396
2   0.439396
1   0.442396
4   0.445396
4   0.448396
5   0.451396
1   0.454396
7   0.457396
8   0.460396
4   0.463396
5   0.466396
9   0.469396
4   0.472396
5   0.475396
6   0.478396
11  0.481396
4   0.484396
4   0.487396
6   0.490396
6   0.493396
10  0.496396
14  0.499396
7   0.502396
10  0.505396
7   0.508396
9   0.511396
8   0.514396
3   0.517396
12  0.520396
9   0.523396
9   0.526396
11  0.529396
8   0.532396
9   0.535396
15  0.538396
9   0.541396
7   0.544396
10  0.547396
6   0.550396
12  0.553396
9   0.556396
7   0.559396
6   0.562396
5   0.565396
11  0.568396
7   0.571396
12  0.574396
8   0.577396
8   0.580396
6   0.583396
9   0.586396
9   0.589396
18  0.592396
10  0.595396
14  0.598396
16  0.601396
14  0.604396
16  0.607396
12  0.610396
19  0.613396
18  0.616396
25  0.619396
22  0.622396
20  0.625396
16  0.628396
22  0.631396
18  0.634396
26  0.637396
26  0.640396
18  0.643396
26  0.646396
39  0.649396
31  0.652396
31  0.655396
37  0.658396
35  0.661396
46  0.664396
49  0.667396
47  0.670396
43  0.673396
46  0.676396
53  0.679396
52  0.682396
47  0.685396
49  0.688396
67  0.691396
58  0.694396
61  0.697396
52  0.700396
74  0.703396
79  0.706396
81  0.709396
62  0.712396
73  0.715396
97  0.718396
73  0.721396
107 0.724396
98  0.727396
89  0.730396
96  0.733396
85  0.736396
97  0.739396
102 0.742396
103 0.745396
126 0.748396
113 0.751396
112 0.754396
134 0.757396
126 0.760396
107 0.763396
120 0.766396
120 0.769396
135 0.772396
153 0.775396
143 0.778396
132 0.781396
145 0.784396
119 0.787396
124 0.790396
155 0.793396
99  0.796396
117 0.799396
127 0.802396
126 0.805396
102 0.808396
118 0.811396
76  0.814396
92  0.817396
75  0.820396
72  0.823396
59  0.826396
42  0.829396
49  0.832396
33  0.835396
38  0.838396
24  0.841396
12  0.844396
5   0.847396
15  0.850396
4   0.853396
6   0.856396
4   0.859396
2   0.862396
2   0.865396
1   0.868396
0   0.871396
2   0.874396
1   0.877396

the histogram looks like this:

enter image description here

and the median value is this:

.581896 which is clearly not the value where half the area lies to the right and half to the left. It is probably somewhere around .7 in this example.

Stefano Potter
  • 3,467
  • 10
  • 45
  • 82
  • Aren't you simply calculating the median of the bin edges and discard the information about the histogram? – cel Nov 08 '15 at 17:59
  • Yea I am pretty sure this isn't the correct method, this is just finding the middle `Value` and not taking into account the areas covered. I will add an edit in a minute that shows one where it clearly doesn't work. Again, I want to find the median where half the area lies to the right and half to the left. – Stefano Potter Nov 08 '15 at 18:01
  • What you conceptually want to do is: 1) Normalize the counts by the total number of observed counts. 2) Iteratively sum the normalized counts until you have accumulated more than 0.5, return the average between the last and this value as approximation to the median. This will not be efficient, but it should help you understanding how this can be solved in general. A combination of `np.cumsum` and `np.searchsorted` could also work quite performant. – cel Nov 08 '15 at 18:08
  • 1
    What you are looking for is the "weighted median" since you have the summarized distribution with frequencies. Try wquantiles package https://pypi.python.org/pypi/wquantiles – ayhan Nov 08 '15 at 18:09
  • Ill give that a look, thanks. – Stefano Potter Nov 08 '15 at 18:14
  • And the correct way to find the median is to take the cumulative frequencies. The interval where the cumulative frequency exceeds %50 of the total contains the median (i.e. go from the minimum value to the maximum by summing the frequencies). You can just use the starting/end points of the intervals or you can use interpolation. – ayhan Nov 08 '15 at 18:15
  • My memory of statistics is a little fuzzy, would you by chance have an example of how to do this? The starting and end points are obviously straightforward to get, but finding the cumulative frequency I am unsure of. – Stefano Potter Nov 08 '15 at 18:17
  • Here on SO, we do not like to do coding for you. As I hinted before, for the cumulative frequency you may want to have a look at `np.cumsum`. I would suggest attempting to solve this from the hints we gave you and put your attempt in your question. – cel Nov 08 '15 at 18:26
  • I understand, I think I understand what you were hinting at, find the cumsum and then write a function to iterate through the value column until .51 of the cum sum is met and that should be approximately the median – Stefano Potter Nov 08 '15 at 18:33
  • np.average has a weighted argument, I don't think there's an equivalent for median http://stackoverflow.com/questions/20601872/numpy-or-scipy-to-calculate-weighted-median – Andy Hayden Nov 08 '15 at 18:38

0 Answers0