Need to separate strings into multiple variables based on numeric versus non-numeric

Question

I have a data frame with one variable. It looks something like this:

df <- data.frame(c("25 Edgemont 52 Sioux County", "57 Burke 88 Papillion-LaVista South"))

To provide more context, each observation/row is a basketball game score. I would like to separate into four data frame columns that splits the numbers and team names up. So for example, the first row would end up as "25" in first column, "Edgemont" in second column, "52" in third column, and Sioux City in fourth column.

I've tried the below and various SO suggestions but can't get the desired results:

df2 <- strsplit(gsub("([0-9]*)([a-z]*)([0-9]*)([a-z]*)", "\\1 \\2 \\3 \\4", df), " ")

Try `library(stringr)` and then `str_extract_all(df[[1]], "\\d+|[^\\d\\s](?:\\D*[^\\d\\s])?")` — Wiktor Stribiżew, Jan 26 '20 at 00:15

G. Grothendieck · Accepted Answer · 2020-01-26T00:42:55.567

1) dplyr/tidyr Replace each number with a semicolon, that number and another semicolon and then separate on the semicolons plus optional surrounding whitespace.

library(dplyr)
library(tidyr)

# input
df <- data.frame(V1 = c("25 Edgemont 52 Sioux County", 
                        "57 Burke 88 Papillion-LaVista South"))

df %>%
  mutate(V1 = gsub("(\\d+)", ";\\1;", V1)) %>%
  separate(V1, c(NA, "No1", "Let1", "No2", "Let2"), sep = " *; *")
##   No1       Let1 No2                     Let2
## 1  25  Edgemont   52             Sioux County
## 2  57     Burke   88  Papillion-LaVista South

1a) read.table We can use the same gsub as in (1) but then separate it using read.table. No packages are used.

read.table(text = gsub("(\\d+)", ";\\1;", df$V1), sep = ";", as.is = TRUE,
  strip.white = TRUE, col.names = c(NA, "No1", "Let1", "No2", "Let2"))[-1]
##   No1     Let1 No2                    Let2
## 1  25 Edgemont  52            Sioux County
## 2  57    Burke  88 Papillion-LaVista South

2) strcapture We can use strcapture from base R:

proto <- list(No1 = integer(0), Let1 = character(0),
              No2 = integer(0), Let2 = character(0))
strcapture("(\\d+) (.*) (\\d+) (.*)", df$V1, proto)
##   No1     Let1 No2                    Let2
## 1  25 Edgemont  52            Sioux County
## 2  57    Burke  88 Papillion-LaVista South

2a) read.pattern We can use read.pattern with the same pattern as in (2):

library(gsubfn)

read.pattern(text = format(df$V1), pattern = "(\\d+) (.*) (\\d+) (.*)", 
  col.names = c("No1", "Let1", "No2", "Let2"), as.is = TRUE, strip.white = TRUE)
##   No1     Let1 No2                    Let2
## 1  25 Edgemont  52            Sioux County
## 2  57    Burke  88 Papillion-LaVista South

#1 (dplyr/tidyr/mutate/gsub) worked perfectly - thanks! – Jeff Swanson Jan 26 '20 at 14:42 — Jeff Swanson, Jan 26 '20 at 14:42

akrun · Answer 2 · 2020-01-26T00:37:41.070

1) One option is extract from tidyr where we extract one or more digits ((\\d+)) at the start (^) of the string as a capture group, followed by a space, then one ore more characters that are letters with space, followed by a space, then one or more digits in a capture group, followed by space and rest of the characters as the 4th column

library(stringr)
library(dplyr)
library(tidyr)
df %>% 
  extract(col1, into = str_c('col', 1:4),
           '^(\\d+) ([A-Za-z ]+) (\\d+) (.*)', convert  = TRUE)
#  col1     col2 col3                    col4
#1   25 Edgemont   52            Sioux County
#2   57    Burke   88 Papillion-LaVista South

2) Or with separate from tidyr where we specify a regex lookaround to split at the space

df %>% 
   separate(col1, into = str_c('col', 1:4), sep = '(?<=\\d) | (?=\\d)')
#  col1     col2 col3                    col4
#1   25 Edgemont   52            Sioux County
#2   57    Burke   88 Papillion-LaVista South

3) Or using tstrsplit from data.table

library(data.table)
setDT(df)[, tstrsplit(col1, "(?<=\\d) | (?=\\d)", perl = TRUE)]
#   V1       V2 V3                      V4
#1: 25 Edgemont 52            Sioux County
#2: 57    Burke 88 Papillion-LaVista South

4) Or using read.csv from base R (No packages are used ...)

read.csv(text = gsub("(?<=\\d) | (?=\\d)", ",", df$col1, 
          perl = TRUE), header = FALSE)
#  V1       V2 V3                      V4
#1 25 Edgemont 52            Sioux County
#2 57    Burke 88 Papillion-LaVista South

5) Or with strsplit from base R (No packages are used ...)

type.convert(as.data.frame(do.call(rbind, 
   strsplit(as.character(df$col1), "(?<=\\d) | (?=\\d)",
           perl = TRUE))), as.is = TRUE)
#  V1       V2 V3                      V4
#1 25 Edgemont 52            Sioux County
#2 57    Burke 88 Papillion-LaVista South

data

df <- data.frame(col1 = c("25 Edgemont 52 Sioux County", 
             "57 Burke 88 Papillion-LaVista South"))

Need to separate strings into multiple variables based on numeric versus non-numeric

2 Answers2

data