1

I am using cSplit to split a column into three separate columns. The separator is " / "

However, one of my fields has embedded the "/" separator. The third element of the third line was supposed to be and stay as "f/j" after the split.

When I try it in the following example, it creates an extra (fourth) column

name <- c("abc / efg / hij", "abc / abc / hij", "efg / efg / f/j", "abd / efj / hij")
y <- c(1,1.2,3.4, 5)

dt <- data.frame(name,y)
dt
dt <- cSplit(dt,"name","/", drop=FALSE)
dt

When I try it in my original data set, which has over 5,000 lines, it produces the following error:

Error in fread(x, sep[i], header = FALSE):

Expecting 3 cols, but line 2307 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep='/' and/or '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
TCS
  • 127
  • 1
  • 11
  • 1
    Use the extra whitespace character surrounding the `/` you want to split on: `cSplit(dt,"name"," / ", drop=FALSE)`? – Abdou Sep 03 '17 at 17:03
  • Sorry, I should have included this in my original post. When I tried this in my data set and I get the following error message: Error in fread(x, sep[i], header = FALSE) : 'sep' must be 'auto' or a single character – TCS Sep 03 '17 at 17:53
  • 1
    `dt$name <- gsub("([^/]+)/([^/]+)/(.*)", "\\1_\\2_\\3", dt$name); cSplit(dt, "name", "_", drop=F)`? This replaces the target `/` characters with underscores and then splits on those underscores instead. – Abdou Sep 03 '17 at 18:02
  • It works, if you post your answer, I can vote it as the right one. TCS – TCS Sep 03 '17 at 18:56

2 Answers2

1

If the data is structured the same way your name vector is structured, you could use the following which relies on the idea that the targeted / characters are surrounded by whitespace characters:

cSplit(dt,"name"," / ", drop=FALSE)

But as you mentioned, that has led to the following error:

Error in fread(x, sep[i], header = FALSE) : 'sep' must be 'auto' or a single character

While I fail to figure out the main cause of that, I think replacing the targeted / characters with an underscore (or anything else different from a /) and then split on the underscores. The following could serve as an illustration:

dt$name <- gsub("([^/]+)/([^/]+)/(.*)", "\\1_\\2_\\3", dt$name)
cSplit(dt, "name", "_", drop=F)

#           name   y name_1 name_2 name_3
# 1: abc_efg_hij 1.0    abc    efg    hij
# 2: abc_abc_hij 1.2    abc    abc    hij
# 3: efg_efg_f/j 3.4    efg    efg    f/j
# 4: abd_efj_hij 5.0    abd    efj    hij

I hope this helps.

Abdou
  • 12,931
  • 4
  • 39
  • 42
0

You should be able to just set fixed = FALSE:

cSplit(dt, "name", " / ", fixed = FALSE, drop = FALSE)
##               name   y name_1 name_2 name_3
## 1: abc / efg / hij 1.0    abc    efg    hij
## 2: abc / abc / hij 1.2    abc    abc    hij
## 3: efg / efg / f/j 3.4    efg    efg    f/j
## 4: abd / efj / hij 5.0    abd    efj    hij
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485