Extract elements from Spark array column using SparklyR "select"

Question

I have a Spark dataframe in a SparklyR interface, and I'm trying to extract elements from an array column.

df <- copy_to(sc, data.frame(A=c(1,2),B=c(3,4)))            ## BUILD DATAFRAME
dfnew <- df %>% mutate(C=Array(A,B)) %>% select(C)          ## CREATE ARRAY COL


> dfnew                                                     ## VIEW DATAFRAME
# Source: spark<?> [?? x 1]                       
  C        
  <list>   
1 <dbl [2]>
2 <dbl [2]>


dfnew %>% sdf_schema()                                      ## VERIFY COLUMN TYPE IS ARRAY
$C$name
[1] "C"

$C$type
[1] "ArrayType(DoubleType,true)"

I can extract an element with "mutate"...

dfnew %>% mutate(myfirst_element=C[[1]]) 

# Source: spark<?> [?? x 2]
  C         myfirst_element
  <list>              <dbl>
1 <dbl [2]>               3
2 <dbl [2]>               4

But I want to extract an element on the fly with "select". However, all attempts just return the full column:

> dfnew %>% select("C"[1]) 
# Source: spark<?> [?? x 1]
  C        
  <list>   
1 <dbl [2]>
2 <dbl [2]>
> dfnew %>% select("C"[[1]]) 
# Source: spark<?> [?? x 1]
  C        
  <list>   
1 <dbl [2]>
2 <dbl [2]>
> dfnew %>% select("C"[[1]][1]) 
# Source: spark<?> [?? x 1]
  C        
  <list>   
1 <dbl [2]>
2 <dbl [2]>
> dfnew %>% select("C"[[1]][[1]]) 
# Source: spark<?> [?? x 1]
  C        
  <list>   
1 <dbl [2]>
2 <dbl [2]>

I've also tried using "sdf_select", without success:

> dfnew %>% sdf_select("C"[[1]][1])
# Source: spark<?> [?? x 1]
  C        
  <list>   
1 <dbl [2]>
2 <dbl [2]>

In PySpark you can access the elements explicitly e.g. col("C")[1]; in scala you can use getItem or element_at; and in SparkR you can also use element_at. But does anyone know a solution in a SparklyR setting? Thanks in advance for any help.

score 1 · Accepted Answer · answered Sep 12 '21 at 00:21

The following solution came to mind.

library(tidyverse)

df = tibble(group = 1:5) %>%
  mutate(C = map(group, ~array(c(1,2),c(3,4)))) 

df
# # A tibble: 5 x 2
# group C            
# <int> <list>       
#   1     1 <dbl [3 x 4]>
#   2     2 <dbl [3 x 4]>
#   3     3 <dbl [3 x 4]>
#   4     4 <dbl [3 x 4]>
#   5     5 <dbl [3 x 4]>

df$C
# [[1]]
# [,1] [,2] [,3] [,4]
# [1,]    1    2    1    2
# [2,]    2    1    2    1
# [3,]    1    2    1    2
# 
# [[2]]
# [,1] [,2] [,3] [,4]
# [1,]    1    2    1    2
# [2,]    2    1    2    1
# [3,]    1    2    1    2
# 
# [[3]]
# [,1] [,2] [,3] [,4]
# [1,]    1    2    1    2
# [2,]    2    1    2    1
# [3,]    1    2    1    2
# 
# [[4]]
# [,1] [,2] [,3] [,4]
# [1,]    1    2    1    2
# [2,]    2    1    2    1
# [3,]    1    2    1    2
# 
# [[5]]
# [,1] [,2] [,3] [,4]
# [1,]    1    2    1    2
# [2,]    2    1    2    1
# [3,]    1    2    1    2



df %>% pull(C) %>% map(~.x[1,])
# [[1]]
# [1] 1 2 1 2
# 
# [[2]]
# [1] 1 2 1 2
# 
# [[3]]
# [1] 1 2 1 2
# 
# [[4]]
# [1] 1 2 1 2
# 
# [[5]]
# [1] 1 2 1 2

df %>% pull(C) %>% map(~.x[,2])
# [[1]]
# [1] 2 1 2
# 
# [[2]]
# [1] 2 1 2
# 
# [[3]]
# [1] 2 1 2
# 
# [[4]]
# [1] 2 1 2
# 
# [[5]]
# [1] 2 1 2

df %>% pull(C) %>% map(~.x[1:2,])
# [[1]]
# [,1] [,2] [,3] [,4]
# [1,]    1    2    1    2
# [2,]    2    1    2    1
# 
# [[2]]
# [,1] [,2] [,3] [,4]
# [1,]    1    2    1    2
# [2,]    2    1    2    1
# 
# [[3]]
# [,1] [,2] [,3] [,4]
# [1,]    1    2    1    2
# [2,]    2    1    2    1
# 
# [[4]]
# [,1] [,2] [,3] [,4]
# [1,]    1    2    1    2
# [2,]    2    1    2    1
# 
# [[5]]
# [,1] [,2] [,3] [,4]
# [1,]    1    2    1    2
# [2,]    2    1    2    1

I guess that's what you are looking for. Of course, this will also work on any array of any size.

Thanks for this solution! I should have clarified that I specifically wanted to use "select" so I could extract other columns along with the array elements. e.g. select("A", "C"[0]). It looks like "pull" is only useful for grabbing a single column. — Jeff, Sep 12 '21 at 20:45
I'm afraid the `select` command will fail to achieve your goal. This could be done by using a small mutation beforehand, but I'm afraid that might not be to your satisfaction as well. — Marek Fiołka, Sep 13 '21 at 16:20
Thank you, Marek! I am accepting your answer - it's likely impossible to do this with "select" in SparklyR. — Jeff, Sep 13 '21 at 18:54
Thanks for accepting. You could probably get the effect you want by creating your own function that returns the selected columns. Your own special `slect`. However, I don't know if the effort is worth the bonus. — Marek Fiołka, Sep 13 '21 at 18:59

Extract elements from Spark array column using SparklyR "select"

1 Answers1