-1

Given the following table, the question is to find for example the top N C2 from each C1.

C1 C2 
1  1
1  2
1  3
1  4
1  ...
2  1
2  2
2  3
2  4
2  ...
....

So if N = 3, the results are

C1 C2 
1  1
1  2
1  3
2  1
2  2
2  3
....

The proposed solutions use the window function and partition by

For example,

SELECT rs.Field1,rs.Field2 
FROM (
    SELECT Field1,Field2, Rank() 
      over (Partition BY Section
            ORDER BY RankCriteria DESC ) AS Rank
    FROM table
) rs WHERE Rank <= 3

I guess what it does is sorting then picking the top N.

However if some categories have less N elements, we can get the top N w/o sorting because the top N must include all elements in the category.

The above query uses Rank(). My question applies to other window functions like row_num() or dense_rank().

Is there a way to ignore the sorting at the case?

Also I am not sure if the underlying engine can optimize the case: whether the inner partition/order considers the outer where constraints before sorting.

Using partition+order+where is a way to get the top-N element from each category. It works perfect if each category has more than N element, but has additional sorting cost otherwise. My question is if there is another approach that works well at both cases. Ideally it does the following

for each category {
   if # of element <= N:
      continue
   sort and get the top N
}

For example, but is there a better SQL?

WITH table_with_count AS (
         SELECT Field1, Field2, RankCriteria, count() over (PARTITION BY Section) as c
         FROM table
),

rs AS (
    SELECT Field1,Field2, Rank() 
      over (Partition BY Section
            ORDER BY RankCriteria DESC ) AS Rank
    FROM table_with_count 
    where c > 10
) 

(SELECT Field1,Field2e FROM rs WHERE Rank <= 10)
     union
(SELECT Field1,Field2 FROM table_with_count WHERE c <= 10)
Joe C
  • 2,757
  • 2
  • 26
  • 46
  • if some categories have less N elements then what you need as a output – Zaynul Abadin Tuhin Feb 22 '19 at 06:32
  • I wanted to return all elements if there are less N elements. The example query with "WHERE Rank <= 10" does the similar thing. – Joe C Feb 22 '19 at 06:36
  • It doesn't and it shouldn't. For starters such optimization would break the contract, according output data should be sorted in groups after window application, with `ORDER BY`. Additionally, if you consider specific implementation, we're much better in sorting than looking around. Not to mention that sorting groups is the least important factor contributing to the cost of the query. – 10465355 Feb 22 '19 at 11:42
  • @user10465355 . . . Huh? The ordering of the result set has nothing to do with the `order by` in window functions. – Gordon Linoff Feb 22 '19 at 12:52
  • @GordonLinoff Isn't it? If I am not mistaken Spark guarantees both partitioner and partition order within partitions (similarly to `repartition` followed by `SORT BY` / `sortWithinPartitions`), custom optimizer rules notwithstanding) after window with both `PARTITION BY` and `ORDER BY` has been applied (though I cannot say how it refers to SQL standard) At the same time sort is pushed into shuffle so it is intrinsic part of the logic. – 10465355 Feb 22 '19 at 12:56

2 Answers2

1

No, an there really shouldn't be. Overall what you describe here is the XY-problem.

You seem to:

  • Worry about sorting, while in fact sorting (with optional secondary sort) is the most efficient way of shuffling / repartitioning data, as it doesn't lead to proliferation of file descriptors. In practice Spark strictly prefers sort over alternatives (hashing) for exactly that reason.
  • Worry about "unnecessary" sorting of small groups, when in fact the problem is intrinsic inefficiency of window functions, which require full shuffle of all data, therefore exhibit the same behavior pattern as infamous groupByKey.

There are more efficient patterns (MLPairRDDFunctions.topByKey being the most prominent example) but these haven't been ported to Dataset API, and would require custom Aggregator It is also possible to approximate selection (for example through quantile approximation), but this increases the number of passes over data, and in many cases won't provide any performance gains.

0

This is too long for a comment.

There is no such optimization. Basically, all the data is sorted when using windowing clauses. I suppose that a database engine could actually use a hash algorithm for the partition by and a sort algorithm for the order by, but I don't think that is a common approach.

In any case, the operation is over the entire set, and it should be optimized for this purpose. Trying not to order a subset would add lots of overhead -- for instance, running the sort multiple times for each subset and counting the number of rows in each subset.

Also note that the comparison to "3" occurs (logically) after the window function. I don't think window functions are typically optimized for such post-filtering (although once again, it is a possible optimization).

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • I realized my question is confusing. Using partition+order+where is a way to get the top-N element from each category. It works perfect if each category has more than N elements, but has additional sorting cost otherwise. My question is if there is another approach that works well at both cases. Ideally it does the following ``` for each category { if (# of element <= N) { continue } else { sort and get the top N } } ``` – Joe C Feb 23 '19 at 06:57
  • @joeC . . . Not in SQL. Counting the number or rows in each "subpartition" is about the same effort as enumerating them. – Gordon Linoff Feb 23 '19 at 12:13
  • How about "WITH less_eq_than_n AS ( SELECT Field1, Field2 FROM table GROUP BY Section HAVING COUNT( * ) <= n ), greater_than_n AS ( SELECT Field1, Field2 FROM table GROUP BY Section HAVING COUNT( * ) > n ), rs AS ( SELECT Field1,Field2, Rank() over (Partition BY Section ORDER BY RankCriteria DESC ) AS Rank FROM greater_than_n ) (SELECT rs.Field1,rs.Field2 FROM rs WHERE Rank <= 3) union less_eq_than_n" – Joe C Feb 23 '19 at 19:30