tidyr separate only first n instances

Question

I have a data.frame in R, which, for simplicity, has one column that I want to separate. It looks like this:

V1
Value_is_the_best_one
This_is_the_prettiest_thing_I've_ever_seen
Here_is_the_next_example_of_what_I_want

My real data is very large (millions of rows), so I'd like to use tidyr's separate function (because it's amazingly fast) to separate out JUST the first few instances. I'd like the result to be the following:

V1       V2     V3     V4 
Value    is     the    best_one
This     is     the    prettiest_thing_I've_ever_seen
Here     is     the    next_example_of_what_I_want

As you can see, the separator is _ the V4 column can have different numbers of the separators. I want to keep V4 (not discard it), but not have to worry about how much stuff is in there. There will always be four columns (i.e. none of my rows have only V1-V3).

Here is my starting tidyr command I've been working with:

separate(df, V1, c("V1", "V2", "V3", "V4"), sep="_")

This gets rid of V4 (and spits out warnings, which isn't the biggest deal).

@aosmith Yes, thank you. I read the documentation like 10 times and somehow kept misunderstanding that! Please put it into an answer! — Gaius Augustus, May 09 '16 at 22:58

score 34 · Accepted Answer · answered May 09 '16 at 23:09

34

You need the extra argument with the "merge" option. This allows only as many splits as you have new columns defined.

separate(df, V1, c("V1", "V2", "V3", "V4"), extra = "merge")

     V1 V2  V3                             V4
1 Value is the                       best_one
2  This is the prettiest_thing_I've_ever_seen
3  Here is the    next_example_of_what_I_want

answered May 09 '16 at 23:09

aosmith

34,856
9
84
118

What if you want to merge the other way? eg, let's say that you've got "John Q Public." I want to break it into two strings: "John Q" and "Public." Is there a straightforward way to do that, other than manually splitting and subsetting? – David Bruce Borenstein Jun 30 '16 at 15:04
1

@DavidBruceBorenstein It sounds like you need to set the `sep` argument so you only split on the last space. – aosmith Jun 30 '16 at 15:16

score 6 · Answer 2 · answered May 10 '16 at 03:42

Here is another option with extract

library(tidyr)
extract(df1, V1, into = paste0("V", 1:4), "([^_]+)_([^_]+)_([^_]+)_(.*)")
#      V1 V2  V3                             V4
# 1 Value is the                       best_one
# 2  This is the prettiest_thing_I've_ever_seen
# 3  Here is the    next_example_of_what_I_want

Another option is stri_split from library(stringi) where we can specify the number of splits

library(stringi)
do.call(rbind, stri_split(df1$V1, fixed="_", n=4))

tidyr separate only first n instances

2 Answers2

Linked

Related