Using the cars
dataset (with a postgres connection con_psql
):
tmp_cars_sdf <-
copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)
We see that the following two sequences of operations, that only interchange the order of filter
and arrange
, lead to different SQL translations:
tmp_cars_sdf %>%
filter(speed == 4) %>%
arrange(dist) %>%
sql_render
# <SQL> SELECT *
# FROM "tmp_cars_sdf"
# WHERE ("speed" = 4.0)
# ORDER BY "dist"
vs
tmp_cars_sdf %>%
arrange(dist) %>%
filter(speed == 4) %>%
sql_render
# <SQL> SELECT *
# FROM (SELECT *
# FROM "tmp_cars_sdf"
# ORDER BY "dist") "dbplyr_006"
# WHERE ("speed" = 4.0)
I am not an expert in SQL but it seems like the ordering is only guaranteed in the former, and not the latter, when collecting results, or using the remote table in further operations:
Is order in a subquery guaranteed to be preserved? https://dba.stackexchange.com/questions/82930/database-implementations-of-order-by-in-a-subquery
What should I do about this?
Edit:
Having done some further investigation I am not sure whether this is important from a data analysis point of view (i.e. treating remote tables just like data frames in ram).
tmp_cars_sdf %>%
arrange(dist) %>%
group_by(speed) %>%
filter(dist > 10) %>%
mutate(lag_dist = lag(dist)) %>%
sql_render
# <SQL> SELECT "speed", "dist", LAG("dist", 1, NULL) OVER
# (PARTITION BY "speed" ORDER BY "dist") AS "lag_dist"
# FROM (SELECT *
# FROM "tmp_cars_sdf"
# ORDER BY "dist") "dbplyr_014"
# WHERE ("dist" > 10.0)
While it seems that the arrange()
seems to be translated into ORDER BY
in a subquery, it turns out that this is not important, at least for calculating a new variable based on the order since the ordering information is not 'forgotten'.