0

this:

drop table if exists temp_a;
create table temp_a
as
select 
 case when rand(123) < 0.4 then 1 
      when  rand(123) >= 0.4 and rand(123) < 0.8 then 2 
      else 3 end as label 
from source_data ;

select label, count(1) as count from temp_a group by label;

but the result is :

label   count(1)    
1       111175     
2       80509       
3       87690      
distribution 
label  count / sum
1      40%
2      28%
3      32%

it does not like 40% 40% 20%, why?

i want to know "why the distribution not like 40% 40% 20%"

cai zoro
  • 1
  • 1
  • 1
    Are you trying to use output of random function and expect it to be certain? Its `random` isnt it so the distribution will be random ? – Koushik Roy Mar 03 '23 at 06:49
  • In hive, the value of rand function will be like -0.03, not like [0, 1] – cai zoro Mar 06 '23 at 03:37
  • 0.03 is between 0 and 1. And rand() can be any value like 0.16972572083627802 as well https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-MathematicalFunctions Still unclear on the requirement. – Koushik Roy Mar 06 '23 at 05:01
  • -0.03 not +0.03 – cai zoro Mar 06 '23 at 06:02
  • hive rand() function will always generate 0 and 1. It cant be negative. If your hive is giving you negative number, pls tell me your tool name and hive version. – Koushik Roy Mar 06 '23 at 06:39

0 Answers0