0

I'm testing the NTILE function on a simple dataset like this:

(id: string, value: double)
A 10
B 3
C 4
D 4
E 4
F 30
C 30
D 10
A 4
H 4

Running the following query against HIVE (on MapReduce)

SELECT tmp.id, tmp.sum_val, NTILE(4) OVER (ORDER BY tmp.sum_val) AS quartile FROM (SELECT id, sum(value) AS sum_val FROM testntile GROUP BY id) AS tmp

works fine with the following result:

(id, sum_val, quartile)
B 3 1
H 4 1
E 4 2
D 14 2
A 14 3
F 30 3
C 34 4

Running the same query against Hive on Spark (v 1.5) still works fine.

Running the same query against Spark SQL 1.5 (CDH 5.5.1)

val result = sqlContext.sql("SELECT tmp.id, tmp.sum_val, NTILE(4) OVER (ORDER BY tmp.sum_val) AS quartile FROM (SELECT id, sum(value) AS sum_val FROM testntile GROUP BY id) AS tmp")
result.collect().foreach(println)

I get the following wrong result:

[B,3.0,0]
[E,4.0,0]
[H,4.0,0]
[A,14.0,0]
[D,14.0,0]
[F,30.0,0]
[C,34.0,0]

IMPORTANT: the result is NOT deterministic because "sometimes" correct values are returned

Running the same algorithm directly on the dataframe

val x = sqlContext.sql("select id, sum(value) as sum_val from testntile group by id")
val w = Window.partitionBy("id").orderBy("sum_val")
val resultDF = x.select( x("id"),x("sum_val"), ntile(4).over(w) )

still returns a wrong result.

Am I doing something wrong? Any ideas? Thanks in advance for your answers.

zero323
  • 322,348
  • 103
  • 959
  • 935
steccami
  • 21
  • 1
  • 4
  • I cannot reproduce the problem. Regarding the last query on DataFrame it is not the same. – zero323 Dec 15 '15 at 17:30
  • Hi, tnx for your reply. In the last example, I am not trying to execute exactly the same query. Instead I am trying to obtain the same result by means of the low level Spark Window functions introduced in version 1.4 (unfortunately the final result is wrong also in this case :-/). – steccami Dec 16 '15 at 09:03

1 Answers1

1

If you use Window.partitionBy("id").orderBy("sum_val") you are grouping by id and after you are applying ntile function. So in this way every group has one element and ntile apply the same value for every id.

In order to achieve your first result, you need to remove partitionBy("id") and use only Window.orderBy("sum_val").

This is how I modify your code:

val w = Window.orderBy("sum_val")
val resultDF = x.orderBy("sum_val").select( x("id"),x("sum_val"), ntile(4).over(w) )

And this is the print of resultDF.show():

+---+-------+-----+ | id|sum_val|ntile| +---+-------+-----+ | B| 3| 1| | E| 4| 1| | H| 4| 2| | D| 14| 2| | A| 14| 3| | F| 30| 3| | C| 34| 4| +---+-------+-----+

klmb
  • 21
  • 1
  • 1
  • 4