12

I have a problem with split column value when element of column has different number of strings. I can do it in plyr e.g.:

library(plyr)
column <- c("jake", "jane jane","john john john")
df <- data.frame(1:3, name = column)
df$name <- as.character(df$name)
df2 <- ldply(strsplit(df$name, " "), rbind)
View(df2)

As a result, we have data frame with number of column related to maximum number of stings in given element.

When I try to do it in dplyr, I used do function:

library(dplyr)
df2 <- df %>%
  do(data.frame(strsplit(.$name, " ")))

but I get an error:

Error in data.frame("jake", c("jane", "jane"), c("john", "john", "john" : 
arguments imply differing number of rows: 1, 2, 3

It seems to me that it should be used rbind function but I do not know where.

Jaap
  • 81,064
  • 34
  • 182
  • 193
Nicolabo
  • 1,337
  • 12
  • 30

1 Answers1

17

You're having troubles because strsplit() returns a list which we then need to apply as.data.frame.list() to each element to get it into the proper format that dplyr requires. Even then it would still require a bit more work to get usable results. Long story short, it doesn't seem like a suitable operation for do().

I think you might be better off using separate() from tidyr. It can easily be used with dplyr functions and chains. It's not clear whether you want to keep the first column since your ldply result for df2 does not have it, so I left it off.

library(tidyr)
separate(df[-1], name, 1:3, " ", extra = "merge")
#      1    2    3
# 1 jake <NA> <NA>
# 2 jane jane <NA>
# 3 john john john

You could also use cSplit. It is also very efficient since it relies on data.table

library(splitstackshape)
cSplit(df[-1], "name", " ")
#    name_1 name_2 name_3
# 1:   jake     NA     NA
# 2:   jane   jane     NA
# 3:   john   john   john

Or more specifically

setnames(df2 <- cSplit(df[-1], "name", " "), names(df2), as.character(1:3))
df2
#       1    2    3
# 1: jake   NA   NA
# 2: jane jane   NA
# 3: john john john
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • 1
    Ok thanks a lot. But what if we do not know how many strings are in given element of column? – Nicolabo Dec 01 '14 at 22:55
  • 1
    If you don't know how many columns there will be, then I would use `cSplit` because it does that work for you. Nice first question by the way. Clearly asked and reproducible. +1 – Rich Scriven Dec 01 '14 at 22:58
  • 1
    @Nicolabo, You could first use `stringr::str_count` to determine the max. no of columns you'd need and then use `tidyr::separate`. Something like this - `len = max(str_count(string = df$name, pattern = " "));` `vec_names = paste0("X", 1:(len + 1));` `separate(df[-1], name, vec_names, " ", extra = "merge");` – steadyfish Sep 27 '15 at 18:20