Why does a select on an ordered DataFrame in Snowpark destroys ordering?

Question

In Snowpark (Python API, version 0.11.0), I try to order an Dataframe according to an attribute COUNT_OBJ, then show the top 5 EVENTDATES. I realized that the subsequent "select" destroys the ordering of the Dataframe. Is that to be expected?

As a longterm spark developer, this is an unexpected behavior

EDIT: More output as requested in comments:

With a small table and unique values for COUNT_OBJ I am not able to reproduce (even setting USE_CACHED_RESULT=False). Do you have duplicate values for COUNT_OBJ? Can you maybe select COUNT_OBJ as well in second example and show the values? — Sergiu, Oct 10 '22 at 07:41
@Sergiu there are some duplicates, but not in the top 5. It seems that the ordering is completely random (and non-determinisitc), re-running the query will produce other results. I updated the pictures above — Raphael Roth, Oct 10 '22 at 07:51
I think it happens as explained [here](https://community.snowflake.com/s/article/SELECT-query-with-LIMIT-clause-returns-non-deterministic-result-if-ORDER-BY-clause-exists-in-different-level). The queries on the SELECT (after the ORDER BY) include the LIMIT outside of the ORDER BY therefore producing non-deterministic results. — Sergiu, Oct 10 '22 at 08:03
maybe related : https://github.com/snowflakedb/snowpark-python/issues/646 — Raphael Roth, Feb 01 '23 at 08:42

score 0 · Accepted Answer · answered Oct 12 '22 at 07:35

As pointed out by @Sergui in the comments, the problem ist the "limit" in the generated SQL (see https://community.snowflake.com/s/article/SELECT-query-with-LIMIT-clause-returns-non-deterministic-result-if-ORDER-BY-clause-exists-in-different-level)

In my case, the solution is to swap the "limit" before the select like this:

In this case, the ordering remains

score 0 · Answer 2 · answered Aug 23 '23 at 15:12

I think what's happening here is you are sorting the df but "show" will pull at random, not necessarily the first 5 entires. If you want to order the entire dataframe and then get the first few entries you can use limit and cache_result()

df.sort(col("count_obj"), ascending=False).cache_result().limit(5).show()

Why does a select on an ordered DataFrame in Snowpark destroys ordering?

2 Answers2