Run length ID in sparklyr

Question

data.table provides a rleid function which I find invaluable - it acts as a ticker when a watched variable(s) changes, ordered by some other variable(s).

library(dplyr)


tbl = tibble(time = as.integer(c(1, 2, 3, 4, 5, 6, 7, 8)), 
             var  = c("A", "A", "A", "B", "B", "A", "A", "A"))

> tbl
# A tibble: 8 × 2
   time   var
  <int> <chr>
1     1     A
2     2     A
3     3     A
4     4     B
5     5     B
6     6     A
7     7     A
8     8     A

Desired result is

> tbl %>% mutate(rleid = data.table::rleid(var))
# A tibble: 8 × 3
   time   var rleid
  <int> <chr> <int>
1     1     A     1
2     2     A     1
3     3     A     1
4     4     B     2
5     5     B     2
6     6     A     3
7     7     A     3
8     8     A     3

I was wondering if I could reproduce something similar using the tools provided by sparklyr. When testing, I found the best I could do was get to the point at which I needed to do a fill forward, but then couldn't achieve that.

library(sparklyr)

spark_install(version = "2.0.2")
sc <- spark_connect(master = "local", 
                    spark_home = spark_home_dir())


spk_tbl = copy_to(sc, tbl, overwrite = TRUE)

spk_tbl %>% 
  mutate(var2 = (var != lag(var, 1L, order = time))) %>%  # Thanks @JaimeCaffarel
  mutate(var3 = if(var2) { paste0(time, var) } else { NA })

Source:   query [8 x 4]
Database: spark connection master=local[4] app=sparklyr local=TRUE

   time   var  var2  var3
  <int> <chr> <lgl> <chr>
1     1     A  TRUE    1A
2     2     A FALSE  <NA>
3     3     A FALSE  <NA>
4     4     B  TRUE    4B
5     5     B FALSE  <NA>
6     6     A  TRUE    6A
7     7     A FALSE  <NA>
8     8     A FALSE  <NA>

I've tried using SparkR, however I much prefer the sparklyr interface and its ease of use, so I'd ideally be able to do this in Spark SQL.

I can of course, already do this by partitioning the data into small enough chunks, collecting it, running a function and sending it back.

For context, the reason I find the rleid to be useful is that I work with a lot of train data, and it's useful to be able to index what run it's on.

Thanks for any help Akhil

I think you can use this: `tbl %>% mutate(rleid = (var != lag(var, 1, default = "asdf"))) %>% mutate(rleid = cumsum(rleid))` It's basically this solution: http://stackoverflow.com/a/33510765/2026277 — Jaime Caffarel, Feb 11 '17 at 11:52
@JaimeCaffarel I didn't notice the neat `cumsum` way to do that... Unfortunately it seems `cumsum` doesn't work in Spark-SQL (or at least I can't make it work). `spk_tbl %>% arrange(time) %>% mutate(rleid = (var != lag(var, 1, order = time, default = FALSE))) %>% mutate(rleid = cumsum(rleid))` — Akhil Nair, Feb 11 '17 at 12:05
OH! I'm wrong - I just needed to cast the boolean to an int first. Thanks! Consider adding as an answer + I can accept. Thanks again — Akhil Nair, Feb 11 '17 at 12:15

score 2 · Accepted Answer · answered Feb 11 '17 at 12:23

2

A working solution in sparklyr would be this:

spk_tbl %>% 
  dplyr::arrange(time) %>% 
  dplyr::mutate(rleid = (var != lag(var, 1, order = time, default = FALSE))) %>% 
  dplyr::mutate(rleid = cumsum(as.numeric(rleid)))

answered Feb 11 '17 at 12:23

Jaime Caffarel

2,401
4
30
42

score 0 · Answer 2 · answered Feb 11 '17 at 12:44

0

Try this:

tbl %>% mutate(run = c(0,cumsum(var[-1L] != var[-length(var)])))
# A tibble: 8 × 3
   time   var   run
  <int> <chr> <dbl>
1     1     A     0
2     2     A     0
3     3     A     0
4     4     B     1
5     5     B     1
6     6     A     2
7     7     A     2
8     8     A     2

answered Feb 11 '17 at 12:44

hvollmeier

2,956
1
12
17

thanks. Would just point out that still suffer from the same casting issue in spark sql as in the original comment answer. I also prefer the `lead`/`lag` soln as it enforces a deterministic ordering! – Akhil Nair Feb 11 '17 at 18:38

Run length ID in sparklyr

2 Answers2