5

I have a data.frame in R, which, for simplicity, has one column that I want to separate. The following sample snipped using tidyr::separate, almost does the job:

 tmp2 <- data.frame( varTreatName = c(
   "resp_Nadd_belowCanopy", "resp_NPadd_belowCanopy"
   , "resp_sd_Nadd_belowCanopy", "resp_sd_NPadd_belowCanopy"))
 tmp2 %>% separate(
    "varTreatName", c("varName","treatment","canopyPosition")
    , extra = "merge")

which yields:

varName treatment    canopyPosition
1    resp      Nadd       belowCanopy
2    resp     NPadd       belowCanopy
3    resp        sd  Nadd_belowCanopy
4    resp        sd NPadd_belowCanopy

Several instances are merged to one column. Note, however, that in the described case the first instance varName 'resp_sd' contains the same delimiter that is used by delimiting the factors to separate (treatment, and canopyPosition). But the merge occurs only on the last instances.

Hence, in the last line of the example above I expect to extract: 'resp_sd', 'NPadd', 'belowCanopy'.

How can I merge the first instances instead of the last ones in order to separate only the last n instances?

Thomas Wutzler
  • 255
  • 1
  • 9

2 Answers2

4

When screening the already answered similar questions, I discovered tidyr::extract in this answer, which can be used to do the job:

 tmp2 %>% extract(
   "varTreatName", c("varName","treatment","canopyPosition")
   , regex = "(.*)_([^_]+)_([^_]+)$")

yielding the expected result:

  varName treatment canopyPosition
1    resp      Nadd    belowCanopy
2    resp     NPadd    belowCanopy
3 resp_sd      Nadd    belowCanopy
4 resp_sd     NPadd    belowCanopy
Thomas Wutzler
  • 255
  • 1
  • 9
2

tidyr::separate takes regular expressions, so you can also do something like this:

library(dplyr)
library(tidyr)

tmp2 %>% 
  separate("varTreatName", c("varName","treatment","canopyPosition"), 
           , sep = "_(?!s)", extra = "merge")

Result:

  varName treatment canopyPosition
1    resp      Nadd    belowCanopy
2    resp     NPadd    belowCanopy
3 resp_sd      Nadd    belowCanopy
4 resp_sd     NPadd    belowCanopy
acylam
  • 18,231
  • 5
  • 36
  • 45
  • Thanks for this answer. Could you, please, explain how the regex on the separator works? In a microbenchmarkof the example the extract-based solution was about 1 third faster. – Thomas Wutzler May 18 '18 at 13:05
  • @ThomasWutzler `separate` uses a regular expression for the `sep` argument to split columns on. `_(?!s)` means _a literal "\_" not followed by an s_. So I am splitting by all underscores except for the ones that is between `resp_sd` because an "s" follows the underscore. – acylam May 18 '18 at 13:44
  • @ThomasWutzler I think `extract` is faster because it is only one match, while `separate` has multiple matches to search for. – acylam May 18 '18 at 13:46
  • Thanks @useR for the explanation of the regular expression. I see that its very specific to the pattern in the example that follows the separator. – Thomas Wutzler May 20 '18 at 04:15