0

In Snowpark (Python API, version 0.11.0), I try to order an Dataframe according to an attribute COUNT_OBJ, then show the top 5 EVENTDATES. I realized that the subsequent "select" destroys the ordering of the Dataframe. Is that to be expected? enter image description here

As a longterm spark developer, this is an unexpected behavior

EDIT: More output as requested in comments:

enter image description here

Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
  • With a small table and unique values for COUNT_OBJ I am not able to reproduce (even setting USE_CACHED_RESULT=False). Do you have duplicate values for COUNT_OBJ? Can you maybe select COUNT_OBJ as well in second example and show the values? – Sergiu Oct 10 '22 at 07:41
  • @Sergiu there are some duplicates, but not in the top 5. It seems that the ordering is completely random (and non-determinisitc), re-running the query will produce other results. I updated the pictures above – Raphael Roth Oct 10 '22 at 07:51
  • 2
    I think it happens as explained [here](https://community.snowflake.com/s/article/SELECT-query-with-LIMIT-clause-returns-non-deterministic-result-if-ORDER-BY-clause-exists-in-different-level). The queries on the SELECT (after the ORDER BY) include the LIMIT outside of the ORDER BY therefore producing non-deterministic results. – Sergiu Oct 10 '22 at 08:03
  • maybe related : https://github.com/snowflakedb/snowpark-python/issues/646 – Raphael Roth Feb 01 '23 at 08:42

2 Answers2

0

As pointed out by @Sergui in the comments, the problem ist the "limit" in the generated SQL (see https://community.snowflake.com/s/article/SELECT-query-with-LIMIT-clause-returns-non-deterministic-result-if-ORDER-BY-clause-exists-in-different-level)

enter image description here

In my case, the solution is to swap the "limit" before the select like this:

In this case, the ordering remains

Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
0

I think what's happening here is you are sorting the df but "show" will pull at random, not necessarily the first 5 entires. If you want to order the entire dataframe and then get the first few entries you can use limit and cache_result()

df.sort(col("count_obj"), ascending=False).cache_result().limit(5).show()
LMP
  • 79
  • 1
  • 1
  • 9