I have looked all over and I'm still unable to get those three dplyr functions to work within sparklyr. I have a reproducible example below. First, some session info:
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.4 (Maipo)
I am running dplyr 0.7.4, sparklyr 0.8.3, spark version 2.2.0
Here is the (desired) result of running dplyr code outside of sparklyr:
set.seed(999)
df <- data.frame(group = letters[rep(1:4, each = 2)],
class = letters[rep(1:4, by = 2)],
value = rnorm(8), stringsAsFactors = FALSE)
> df
group class value
1 a a -0.9677497
2 a b -1.1210094
3 b c 1.3254637
4 b d 0.1339774
5 c a 0.9387494
6 c b 0.1725381
7 d c 0.9576504
8 d d -1.3626862
df %>%
group_by(group) %>%
summarize(value = sum(value),
class = first(class))
# A tibble: 4 x 3
group value class
<chr> <dbl> <chr>
1 a -1.59 a
2 b 1.07 c
3 c -0.843 a
4 d -3.15 c
However, when I copy over that data.frame to spark, the result is not what I expect:
df <- sdf_copy_to(sc, df, "df", memory = FALSE, overwrite = TRUE)
df %>%
group_by(group) %>%
summarize(value = sum(value),
class = first(class))
# Source: lazy query [?? x 3]
# Database: spark_connection
group value class
<chr> <dbl> <chr>
1 d -3.15 `class`
2 c -0.843 `class`
3 b 1.07 `class`
4 a -1.59 `class`
I also tried to see if there was a namespace issue but that did not solve this problem:
df %>%
group_by(group) %>%
summarize(value = sum(value),
class = dplyr::first(class))
Error in x[[n]] : object of type 'builtin' is not subsettable
In my non-reproducible example I was also sometimes getting the following error depending on how I changed the code, but I haven't gotten it to show for this example.
Error in nth(x, -1L, order_by = order_by, default = default) :
object 'class' not found
Any help (including alternatives) would be greatly appreciated!