4

I have looked all over and I'm still unable to get those three dplyr functions to work within sparklyr. I have a reproducible example below. First, some session info:

R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.4 (Maipo)

I am running dplyr 0.7.4, sparklyr 0.8.3, spark version 2.2.0

Here is the (desired) result of running dplyr code outside of sparklyr:

set.seed(999)

df <- data.frame(group = letters[rep(1:4, each = 2)],
                 class = letters[rep(1:4, by = 2)],
                 value = rnorm(8), stringsAsFactors = FALSE)

> df
  group class      value
1     a     a -0.9677497
2     a     b -1.1210094
3     b     c  1.3254637
4     b     d  0.1339774
5     c     a  0.9387494
6     c     b  0.1725381
7     d     c  0.9576504
8     d     d -1.3626862

df %>% 
  group_by(group) %>% 
  summarize(value = sum(value),
            class = first(class))

# A tibble: 4 x 3
  group  value class
  <chr>  <dbl> <chr>
1 a     -1.59  a    
2 b      1.07  c    
3 c     -0.843 a    
4 d     -3.15  c 

However, when I copy over that data.frame to spark, the result is not what I expect:

df <- sdf_copy_to(sc, df, "df", memory = FALSE, overwrite = TRUE)

df %>% 
  group_by(group) %>% 
  summarize(value = sum(value),
            class = first(class))

# Source:   lazy query [?? x 3]
# Database: spark_connection
  group  value class  
  <chr>  <dbl> <chr>  
1 d     -3.15  `class`
2 c     -0.843 `class`
3 b      1.07  `class`
4 a     -1.59  `class`

I also tried to see if there was a namespace issue but that did not solve this problem:

df %>% 
  group_by(group) %>% 
  summarize(value = sum(value),
            class = dplyr::first(class))

Error in x[[n]] : object of type 'builtin' is not subsettable

In my non-reproducible example I was also sometimes getting the following error depending on how I changed the code, but I haven't gotten it to show for this example.

Error in nth(x, -1L, order_by = order_by, default = default) : 
  object 'class' not found

Any help (including alternatives) would be greatly appreciated!

Hutch3232
  • 408
  • 4
  • 11
  • 3
    You can check [here](https://github.com/rstudio/sparklyr/issues/1051) – akrun Jul 23 '18 at 20:30
  • Thanks for the link. I tried to use the top_n function based on that link and could not get it to work. Same with first_value, though I think that function is expecting a numeric column. – Hutch3232 Jul 23 '18 at 20:48
  • 1
    `first` and `last` are non-deterministic outside window frame context, so if you don't mind that, `min` or `max` should do just fine. `first_value` works fine (at least with 2.3.1) with the same limitation. You can also use window functions `mutate` in place of `summarise` + `row_number` to get deterministic value but the cost of that will be much higher. Finally there is always "use SQL" path. `first` and `last` won't work directly, because _These are straightforward wrappers around ‘[[’_ and as such, makes no sense in Spark context. – zero323 Jul 24 '18 at 00:49

1 Answers1

2

I had the same problem, this should work.

df %>% 
group_by(group) %>% 
summarize(value = sum(value),
          class = first_value(class))

It works good with both character or numeric columns.

By the way, I'm using dplyr 0.8.0.1 and sparklyr 0.9.4

EstevaoLuis
  • 2,422
  • 7
  • 33
  • 40
Ayar Paco
  • 51
  • 4