strsplit split on either or depending on

Question

Once again I'm struggling with strsplit. I'm transforming some strings to data frames, but there's a forward slash, / and some white space in my string that keep bugging me. I could work around it, but I eager to learn if I can use some fancy either or in strsplit. My working example below should illustrate the issue

The strsplit function I'm currrently using

str_to_df <- function(string){
t(sapply(1:length(string), function(x) strsplit(string, "\\s+")[[x]])) }

one type of string I got,

string1 <- c('One\t58/2', 'Two 22/3', 'Three\t15/5')
str_to_df(string1)
#>      [,1]    [,2]  
#> [1,] "One"   "58/2"
#> [2,] "Two"   "22/3"
#> [3,] "Three" "15/5"

another type I got in the same spot,

string2 <- c('One 58 / 2', 'Two 22 / 3', 'Three 15 / 5')
str_to_df(string2)
#>      [,1]    [,2] [,3] [,4]
#> [1,] "One"   "58" "/"  "2" 
#> [2,] "Two"   "22" "/"  "3" 
#> [3,] "Three" "15" "/"  "5"

They obviously create different outputs, and I can't figure out how to code a solution that work for both. Below is my desired outcome. Thank you in advance!

desired_outcome <- structure(c("One", "Two", "Three", "58", "22",
                               "15", "2", "3", "5"), .Dim = c(3L, 3L))
desired_outcome
#>      [,1]    [,2] [,3]
#> [1,] "One"   "58" "2" 
#> [2,] "Two"   "22" "3" 
#> [3,] "Three" "15" "5"

You can split by any non-word (alphanumeric) characters: `t(simplify2array(strsplit(string1, '\\W+')))` — alistaire, Apr 23 '18 at 15:57

kath · Answer 1 · 2018-04-23T15:38:55.263

This works:

str_to_df <- function(string){
  t(sapply(1:length(string), function(x) strsplit(string, "[/[:space:]]+")[[x]])) }

string1 <- c('One\t58/2', 'Two 22/3', 'Three\t15/5')
string2 <- c('One 58 / 2', 'Two 22 / 3', 'Three 15 / 5')

str_to_df(string1)
#      [,1]    [,2] [,3]
# [1,] "One"   "58" "2" 
# [2,] "Two"   "22" "3" 
# [3,] "Three" "15" "5"

str_to_df(string2)
#      [,1]    [,2] [,3]
# [1,] "One"   "58" "2" 
# [2,] "Two"   "22" "3" 
# [3,] "Three" "15" "5"

Another approach with tidyr could be:

string1 %>% 
  as_tibble() %>% 
  separate(value, into = c("Col1", "Col2", "Col3"), sep = "[/[:space:]]+")

# A tibble: 3 x 3
#   Col1  Col2  Col3 
#   <chr> <chr> <chr>
# 1 One   58    2    
# 2 Two   22    3    
# 3 Three 15    5

You don't need `sapply`, since `strsplit` will return a list with an element for each input, so you can just use `simplify2array` (which is what `sapply` uses to simplify), so `t(simplify2array(strsplit(string, "[/[:space:]]+")))` — alistaire, Apr 23 '18 at 15:55

akrun · Accepted Answer · 2018-04-23T15:49:59.467

5

We can create a function to split at one or more space or tab or forward slash

f1 <- function(str1) do.call(rbind, strsplit(str1, "[/\t ]+"))
f1(string1)
#    [,1]    [,2] [,3]
#[1,] "One"   "58" "2" 
#[2,] "Two"   "22" "3" 
#[3,] "Three" "15" "5" 

f1(string2)
#     [,1]    [,2] [,3]
#[1,] "One"   "58" "2" 
#[2,] "Two"   "22" "3" 
#[3,] "Three" "15" "5"

Or we can do with read.csv after replacing the spaces with a common delimiter

read.csv(text=gsub("[\t/ ]+", ",", string1), header = FALSE)
#     V1 V2 V3
#1   One 58  2
#2   Two 22  3
#3 Three 15  5

edited Apr 23 '18 at 15:49

answered Apr 23 '18 at 15:31

akrun

874,273
37
540
662

1

I really like your last solution, i.e. `read.csv(text=gsub("[\t/ ]+", ",", sold_chr), header = FALSE)`, it even handles when there's no numeric values, i.e. `string3 <- c('One 58 / ', 'Two / 3', 'Three 15 / 5’)`. I am truly grateful. I did however also realize that I do have instances where there are two or more words. That is, spacing between alpha numeric haters, i.e. `string4 <- c('Two / 3', 'Three 15 / 5', ‘Four Cats / 5', ‘Five B dogs 7 / 0’)` Could you possibly point towards a resource that could help me solve this or if it's not too demanding could you suggest a solution? Thx – Eric Fail Apr 23 '18 at 17:18
1

@EricFail You could do `read.csv(text=sub("^([A-Za-z ]+)\\s(\\d*)\\s*[/]\\s*(\\d*)$", "\\1,\\2,\\3", string4), header = FALSE, fill = TRUE)` – akrun Apr 24 '18 at 01:36

strsplit split on either or depending on

2 Answers2