dplyr's arrange operation does not seem to be commutative within a remote source, when does it matter?

Question

Using the cars dataset (with a postgres connection con_psql):

tmp_cars_sdf <-
    copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)

We see that the following two sequences of operations, that only interchange the order of filter and arrange, lead to different SQL translations:

tmp_cars_sdf %>%
    filter(speed == 4) %>%
    arrange(dist) %>%
    sql_render

# <SQL> SELECT *
#     FROM "tmp_cars_sdf"
# WHERE ("speed" = 4.0)
# ORDER BY "dist"

vs

tmp_cars_sdf %>%
    arrange(dist) %>%
    filter(speed == 4) %>%
    sql_render

# <SQL> SELECT *
#     FROM (SELECT *
#               FROM "tmp_cars_sdf"
#           ORDER BY "dist") "dbplyr_006"
# WHERE ("speed" = 4.0)

I am not an expert in SQL but it seems like the ordering is only guaranteed in the former, and not the latter, when collecting results, or using the remote table in further operations:

Is order in a subquery guaranteed to be preserved? https://dba.stackexchange.com/questions/82930/database-implementations-of-order-by-in-a-subquery

What should I do about this?

Edit:

Having done some further investigation I am not sure whether this is important from a data analysis point of view (i.e. treating remote tables just like data frames in ram).

tmp_cars_sdf %>%
    arrange(dist) %>%
    group_by(speed) %>%
    filter(dist > 10) %>%
    mutate(lag_dist = lag(dist)) %>%
    sql_render

# <SQL> SELECT "speed", "dist", LAG("dist", 1, NULL) OVER 
# (PARTITION BY "speed" ORDER BY "dist") AS "lag_dist"
# FROM (SELECT *
#           FROM "tmp_cars_sdf"
#       ORDER BY "dist") "dbplyr_014"
# WHERE ("dist" > 10.0)

While it seems that the arrange() seems to be translated into ORDER BY in a subquery, it turns out that this is not important, at least for calculating a new variable based on the order since the ordering information is not 'forgotten'.

I think this has been discussed on the `dbplyr` bug tracker: https://github.com/tidyverse/dbplyr/issues/276, I don't know if they are likely to "fix it" or whether this is just an issue that can't be solved 100% consistently. — Marius, Aug 19 '19 at 05:30
I don't know enough about SQL to know whether this is "terrible". If there's a good solution you should absolutely suggest it on the bug tracker, I guess it would need to work for a variety of different DB backends and uses. — Marius, Aug 19 '19 at 05:40

score 1 · Answer 1 · answered Aug 20 '19 at 04:20

You are correct, ordering is only preserved in your first example, not in your second. In fact, your second example may cause errors in some versions of SQL that do not accept ORDER BY clauses in subqueries.

When you use dbplyr the remote table is essentially defined by the SQL query that constructs the current state of the table. dbplyr translation into SQL works incrementally/one command at a time, taking the existing query and augmenting it to reflect the next transformation. This often results in the previous query becoming a subquery of the new query.

In general, only when results are returned to R (such as by collect) does the query evaluate. Some tricks can be used to force & save intermediate evaluations, but I have not found any that guarantee ordering.

Given the above, my recommendation is:

Only use the arrange command for preparing output for viewing by a human as databases generally do not care about order.
Use the order by clauses within functions instead of arrange when order matters for creating new variables. For example:
- Use my_data %>% mutate(new = lag(old, order_by = "date"))
- Instead of my_data %>% arrange(date) %>% mutate(new = lag(old))

You may also find this helpful for row numbers/rank.

dplyr's arrange operation does not seem to be commutative within a remote source, when does it matter?

Edit:

1 Answers1