Splitting strings into number and string (with missings)

Question

I am trying to separate numbers and characters in a column of strings. So far I have been using tidyr::separate for doing this, but am encountering errors for "unusual" cases.

Suppose I have the following data

df <- data.frame(c1 = c("5.5K", "2M", "3.1", "M"))

And I want to obtain a data frame with columns

data.frame(c2 = c("5.5", "2", "3.1", NA),
c3 = c("K", "M", NA, "M))

So far I have been using tidyr::separate

df %>%
separate(c1, into =c("c2", "c3"), sep = "(?<=[0-9])(?=[A-Za-z])")

But this only works for the first three cases. I realize this is because ?<=... and ?=... require the presence of the regex. How would one modify this code to capture the cases where the numbers are missing before the letters? Been trying to use the extract function too, but without success.

Edit: I suppose one solution is to break this up into

df$col2 <- as.numeric(str_extract(df$col1, "[0-9]+"))
df$col3 <- (str_extract(df$col1, "[aA-zZ]+"))

But I was curious whether were other ways to handle it.

VicaYang · Answer 1 · 2019-04-16T04:34:37.420

2

extract(df, c1, into =c("c2", "c3"), "([\\.\\d]*)([a-zA-Z]*)")
#    c2 c3
# 1 5.5  K
# 2   2  M
# 3 3.1   
# 4      M

You can use seperate simply in this way, but there should be a more elegant method..

df %>% separate(c1, into =c("c2", "c3"), sep = "(?=[A-Za-z])")
#    c2   c3
# 1 5.5    K
# 2   2    M
# 3 3.1 <NA>
# 4        M

edited Apr 16 '19 at 04:34

answered Apr 16 '19 at 04:23

VicaYang

544
3
17

Thank you! This was great help! – user11151932 Apr 17 '19 at 20:17

score 1 · Answer 2 · answered Apr 16 '19 at 04:31

1

We can use base R sub to remove characters and numbers respectively to get different columns.

df$c2 <- sub("[A-Za-z]+", "", df$c1)
df$c3 <- sub("\\d*\\.?\\d*", "", df$c1)

df
#    c1  c2 c3
#1 5.5K 5.5  K
#2   2M   2  M
#3  3.1 3.1   
#4    M      M

You can remove c1 column if not needed later by doing df$c1 <- NULL.

answered Apr 16 '19 at 04:31

Ronak Shah

377,200
20
156
213

Thank you! I ended up doing something very similar! – user11151932 Apr 17 '19 at 20:17

score 0 · Answer 3 · answered Apr 16 '19 at 05:36

You can also use regex grouping \1 and \2. This is very similar to and adapted from @Ronak Shah's answer but with regex grouping

# data
df <- data.frame(c1 = c("5.5K", "2M", "3.1", "M"))

# keep only numeric
df$c2 <- sub("(\\d*\\.?\\d*)([A-Za-z]*)", "\\1", df$c1)

# keep only alphabets
df$c3 <- sub("(\\d*\\.?\\d*)([A-Za-z]*)", "\\2", df$c1)
df[df == ""] = NA

df
#>     c1   c2   c3
#> 1 5.5K  5.5    K
#> 2   2M    2    M
#> 3  3.1  3.1 <NA>
#> 4    M <NA>    M

^{Created on 2019-04-16 by the reprex package (v0.2.1)}

akrun · Answer 4 · 2019-04-16T05:48:02.030

0

We can use extract from tidyr

library(tidyr)
extract(df, c1, into = c("c2", "c3"), "^([0-9.]*)([A-Z]*)",
        convert = TRUE, remove = FALSE)
#    c1  c2 c3
#1 5.5K 5.5  K
#2   2M 2.0  M
#3  3.1 3.1   
#4    M  NA  M

Or with read.csv from base R

read.csv(text= sub("^([0-9.]*)", "\\1,", df$c1), 
   header = FALSE, stringsAsFactors = FALSE, col.names = c("c2", "c3"))

edited Apr 16 '19 at 05:48

answered Apr 16 '19 at 05:42

akrun

874,273
37
540
662

score 0 · Answer 5 · answered Oct 08 '19 at 15:22

0

You could use the package unglue :

df <- data.frame(c1 = c("5.5K", "2M", "3.1", "M"))

library(unglue)
unglue_unnest(df, c1, "{c2}{c3=\\D*}", convert = TRUE)
#>    c2 c3
#> 1 5.5  K
#> 2 2.0  M
#> 3 3.1   
#> 4  NA  M

answered Oct 08 '19 at 15:22

moodymudskipper

46,417
11
121
167

Splitting strings into number and string (with missings)

5 Answers5