0

My Hive table currently looks like this:

Numbers
0
0
-0.12745098
-0.218905473
0.026011561
0.235294118
-0.028
-0.052356021
0.052753355
0.008032129
0.012768817
0.115384615
0.040816327

The type is DOUBLE_TYPE. I would like to calculate the median. I would expect the answer to be 0.008032129, since this is the 7th observation ordering my numbers.

When I run this code (as suggested here How to calculate median in Hive):

select percentile_approx(Numbers, 0.5) AS Numbers
from tryout1

The answer I get is : 0.0040160642570281121. This is unexpected, and not even one of the numbers in my list! Does anyone know why Hive gives me this number, and what I should fix to make it work? If you know an entirely different way to calculate the median, I am also very interested!

James Z
  • 12,209
  • 10
  • 24
  • 44
LisaH
  • 25
  • 7

2 Answers2

1

Indeed the function percentile_approx in hive is not performing well. Kudos to Liza for getting an approx answer:

FROM MY TRIALs:

select percentile_approx(numbers , 0.5 , 10 ) as A_mdn from tryout1 ;
-0.007249852187499999

FROM LIZA:

select (percentile(cast((numbers*1000000) as BIGINT), 0.5))/1000000 as A_mdn from tryout1;
0.008032
jose_bacoy
  • 12,227
  • 1
  • 20
  • 38
0

You can use the percentile function to compute the median and Try to cast the complete column into int or BIGINT and see if you come close to the answer. Try this:

select percentile(cast(g_rek_brutowinst as BIGINT), 0.5) AS g_rek_brutowinst from tryout1
Sahil Desai
  • 3,418
  • 4
  • 20
  • 41
  • I don't think casting it as BIGINT will work, since that would reduce all my numbers to 0. – LisaH Mar 01 '18 at 09:11
  • yes, try this. Lets see we can get proper output or not. – Sahil Desai Mar 01 '18 at 09:12
  • Yes, it gave me the answer 0. Which makes sense , since it casts to INT. But in this case, its important that I get 0.008032129 and not 0. – LisaH Mar 01 '18 at 09:14
  • Yes, I tried your exact code. It gave me the answer 0. Not 0.008032129. – LisaH Mar 01 '18 at 09:18
  • Like this? select percentile(g_rek_brutowinst, 0.5) AS g_rek_brutowinst from tryout1 – LisaH Mar 01 '18 at 09:23
  • I used this simple query SELECT percentile(Numbers, 0.5) median FROM df and got the proper result as 0.008032129. – Sahil Desai Mar 01 '18 at 10:50
  • If I may ask: what are you using? Because it gives me this error : Error while compiling statement: FAILED: NoMatchingMethodException No matching method for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (double, double). Possible choices: _FUNC_(bigint, array) _FUNC_(bigint, double) – LisaH Mar 01 '18 at 12:00
  • can you show me your query. there is might be some mistake from your side. – Sahil Desai Mar 01 '18 at 12:01
  • SELECT percentile(g_rek_brutowinst, 0.5)median FROM tryout1 BTW: I'm currently also trying this select (percentile(cast((g_rek_brutowinst*1000000) as BIGINT), 0.5))/1000000 AS g_rek_brutowinst from tryout1 , which is not very pretty but helps with keeping some digits. – LisaH Mar 01 '18 at 12:32