I'm testing the NTILE function on a simple dataset like this:
(id: string, value: double)
A 10
B 3
C 4
D 4
E 4
F 30
C 30
D 10
A 4
H 4
Running the following query against HIVE (on MapReduce)
SELECT tmp.id, tmp.sum_val, NTILE(4) OVER (ORDER BY tmp.sum_val) AS quartile FROM (SELECT id, sum(value) AS sum_val FROM testntile GROUP BY id) AS tmp
works fine with the following result:
(id, sum_val, quartile)
B 3 1
H 4 1
E 4 2
D 14 2
A 14 3
F 30 3
C 34 4
Running the same query against Hive on Spark (v 1.5) still works fine.
Running the same query against Spark SQL 1.5 (CDH 5.5.1)
val result = sqlContext.sql("SELECT tmp.id, tmp.sum_val, NTILE(4) OVER (ORDER BY tmp.sum_val) AS quartile FROM (SELECT id, sum(value) AS sum_val FROM testntile GROUP BY id) AS tmp")
result.collect().foreach(println)
I get the following wrong result:
[B,3.0,0]
[E,4.0,0]
[H,4.0,0]
[A,14.0,0]
[D,14.0,0]
[F,30.0,0]
[C,34.0,0]
IMPORTANT: the result is NOT deterministic because "sometimes" correct values are returned
Running the same algorithm directly on the dataframe
val x = sqlContext.sql("select id, sum(value) as sum_val from testntile group by id")
val w = Window.partitionBy("id").orderBy("sum_val")
val resultDF = x.select( x("id"),x("sum_val"), ntile(4).over(w) )
still returns a wrong result.
Am I doing something wrong? Any ideas? Thanks in advance for your answers.