I am using sparklyr for a project. I have a Spark Dataframe with lists in some of the columns and I'd like to separate them into multiple rows, i.e. have one value in each row, exactly like separate_rows
does in dplyr
.
So basically my dataframe is like this
| x | y
1| [a,b] | [c,d]
And I'd like to have something like this in the end :
| x | y
1| a | c
2| b | d
Like suggested in this post, explode
is a good start, but it can do the job for only one column at once ; and if I use it twice, I will end up with 4 rows here instead of the 2 I want. In this very simple example, I could manage my way to keep only the rows that I want, but things can get a bit messier if there are more than two elements in the lists...
Something I thought about would be to do :
Merge the columns
x
andy
into a single column which would contain[[a,c] , [b,d]]
Then use
explode
to have[a,c]
and then[b,d]
Then explode but in columns (rather that in rows).
Only I don't know how to do 1) and 3).
Thank you for the help !
Here is a reproducible example obtained with collect
and dput
:
structure(list(ref_amount = list(list(967.66, 1592.56), list(
967.66, 1592.56)), ref_theta = list(list(5.26977034898459,
5.16119062369122), list(5.26977034898459, 5.16119062369122))), .Names = c("ref_amount",
"ref_theta"), row.names = c(NA, -2L), class = c("tbl_df", "tbl",
"data.frame"))
`.
– Vincent Jul 26 '18 at 17:07