2

I have a data set :

+-----+-------------------+---------------------+------------------+
|query|similar_queries    |model_score          |count             |
+-----+-------------------+---------------------+------------------+
|shirt|funny shirt        |0.0034038130658784866|189.0             |
|shirt|shirt womens       |0.0019435265241921438|136.0             |
|shirt|watch              |0.001097496453284101 |212.0             |
|shirt|necklace           |6.694577024597908E-4 |151.0             |
|shirt|white shirt        |0.0037413097560623485|217.0             |
|shirt|shoes              |0.0022062579255572733|575.0             |
|shirt|crop top           |9.065831060804897E-4 |173.0             |
|shirt|polo shirts for men|0.007706416273211698 |349.0             |
|shirt|shorts             |0.002669621942466027 |200.0             |
|shirt|black shirt        |0.03264296242546658  |114.0             |
+-----+-------------------+---------------------+------------------+

I am ranking the data set based on "count" first.

lazy val countWindowByFreq = Window.partitionBy(col(QUERY)).orderBy(col(COUNT).desc)
val ranked_data = data.withColumn("count_rank", row_number over countWindowByFreq)

+-----+-------------------+---------------------+------------------+----------+
|query|similar_queries    |model_score          |count             |count_rank|
+-----+-------------------+---------------------+------------------+----------+
|shirt|shoes              |0.0022062579255572733|575.0             |1         |
|shirt|polo shirts for men|0.007706416273211698 |349.0             |2         |
|shirt|white shirt        |0.0037413097560623485|217.0             |3         |
|shirt|watch              |0.001097496453284101 |212.0             |4         |
|shirt|shorts             |0.002669621942466027 |200.0             |5         |
|shirt|funny shirt        |0.0034038130658784866|189.0             |6         |
|shirt|crop top           |9.065831060804897E-4 |173.0             |7         |
|shirt|necklace           |6.694577024597908E-4 |151.0             |8         |
|shirt|shirt womens       |0.0019435265241921438|136.0             |9         |
|shirt|black shirt        |0.03264296242546658  |114.0             |10        |
+-----+-------------------+---------------------+------------------+----------+

I am now trying to rank the content using a rolling window on row_number(4 rows) and rank within the window based on model_score. For e.g.:

In the first window, row_number 1 to 4, the new rank(new column) will be

1. polo shirts for men
2. white shirt
3. shoes
4. watch

In the first window, row_number 5 to 8, the new rank(new column) will be

5. funny shirt
6. shorts
7. shirt womens 
8. crop top

In the first window, row_number 9 to rest, the new rank(new column) will be

9. black shirt 
10. shirt womens

Can some one tell me if there how can I achieve with this spark and Scala ? Is there any pre-defined functions I can use ?

I tried :

lazy val MODEL_RANK = Window.partitionBy(col(QUERY)) .orderBy(col(MODEL_SCORE).desc).rowsBetween( 0, 3)

but this gives me :

sql.AnalysisException: Window Frame ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING must match the required frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW;

Also, tried with .rowsBetween(-3, 0) but that also gives me error :

org.apache.spark.sql.AnalysisException: Window Frame ROWS BETWEEN 3 PRECEDING AND CURRENT ROW must match the required frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW;
user3407267
  • 1,524
  • 9
  • 30
  • 57
  • What is the expected output data frame? – ollik1 May 14 '19 at 06:16
  • @ollik1 The expected o/p is 1. polo shirts for men 2. white shirt 3. shoes 4. watch 5. funny shirt 6. shorts 7. shirt womens 8. crop top 9. black shirt 10. shirt womens – user3407267 May 14 '19 at 07:09

1 Answers1

3

Since you have count_rank computed, next step is to find a way to group the row in a set of fours. It can be done as follows:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

val ranked_data_grouped = ranked_data
  .withColumn("bucket", (($"count_rank" -1)/4).cast(IntegerType))

ranked_data_grouped will look like:

+-----+-------------------+---------------------+------------------+----------+-------+
|query|similar_queries    |model_score          |count             |count_rank|bucket |
+-----+-------------------+---------------------+------------------+----------+-------+
|shirt|shoes              |0.0022062579255572733|575.0             |1         |0      |
|shirt|polo shirts for men|0.007706416273211698 |349.0             |2         |0      |      
|shirt|white shirt        |0.0037413097560623485|217.0             |3         |0      |
|shirt|watch              |0.001097496453284101 |212.0             |4         |0      |
|shirt|shorts             |0.002669621942466027 |200.0             |5         |1      |
|shirt|funny shirt        |0.0034038130658784866|189.0             |6         |1      |
|shirt|crop top           |9.065831060804897E-4 |173.0             |7         |1      |
|shirt|necklace           |6.694577024597908E-4 |151.0             |8         |1      |
|shirt|shirt womens       |0.0019435265241921438|136.0             |9         |2      |
|shirt|black shirt        |0.03264296242546658  |114.0             |10        |2      |
+-----+-------------------+---------------------+------------------+----------+-------+

Now, all you have to do is, partition by bucket & order by model_score:

val output = ranked_data_grouped
  .withColumn("finalRank", row_number().over(Window.partitionBy($"bucket").orderBy($"model_score".desc)))
vdep
  • 3,541
  • 4
  • 28
  • 54
  • But this doesn't give finalRank from 1 to n.. it gives me 1..4 again 1..4 and so on.. Is there a way to get final rank 1.. n ie.. 1..4 (group 0) followed 5..8 (rank 1 to 4 of group1).. – user3407267 May 14 '19 at 07:29
  • 1
    I got it- val output = ranked_data_grouped .withColumn("finalRanksTemp", row_number().over(Window.partitionBy($"bucket").orderBy(col("model_score").desc))) .withColumn("finalRanks", row_number().over(Window.partitionBy($"query").orderBy(col("bucket"), col("finalRanksTemp")))) – user3407267 May 14 '19 at 07:33