6

I am trying to separate numbers and characters in a column of strings. So far I have been using tidyr::separate for doing this, but am encountering errors for "unusual" cases.

Suppose I have the following data

df <- data.frame(c1 = c("5.5K", "2M", "3.1", "M"))

And I want to obtain a data frame with columns

data.frame(c2 = c("5.5", "2", "3.1", NA),
c3 = c("K", "M", NA, "M))

So far I have been using tidyr::separate

df %>%
separate(c1, into =c("c2", "c3"), sep = "(?<=[0-9])(?=[A-Za-z])")

But this only works for the first three cases. I realize this is because ?<=... and ?=... require the presence of the regex. How would one modify this code to capture the cases where the numbers are missing before the letters? Been trying to use the extract function too, but without success.

Edit: I suppose one solution is to break this up into

df$col2 <- as.numeric(str_extract(df$col1, "[0-9]+"))
df$col3 <- (str_extract(df$col1, "[aA-zZ]+"))

But I was curious whether were other ways to handle it.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
user11151932
  • 223
  • 1
  • 5

5 Answers5

2
extract(df, c1, into =c("c2", "c3"), "([\\.\\d]*)([a-zA-Z]*)")
#    c2 c3
# 1 5.5  K
# 2   2  M
# 3 3.1   
# 4      M

You can use seperate simply in this way, but there should be a more elegant method..

df %>% separate(c1, into =c("c2", "c3"), sep = "(?=[A-Za-z])")
#    c2   c3
# 1 5.5    K
# 2   2    M
# 3 3.1 <NA>
# 4        M
VicaYang
  • 544
  • 3
  • 17
1

We can use base R sub to remove characters and numbers respectively to get different columns.

df$c2 <- sub("[A-Za-z]+", "", df$c1)
df$c3 <- sub("\\d*\\.?\\d*", "", df$c1)

df
#    c1  c2 c3
#1 5.5K 5.5  K
#2   2M   2  M
#3  3.1 3.1   
#4    M      M

You can remove c1 column if not needed later by doing df$c1 <- NULL.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
0

You can also use regex grouping \1 and \2. This is very similar to and adapted from @Ronak Shah's answer but with regex grouping

# data
df <- data.frame(c1 = c("5.5K", "2M", "3.1", "M"))

# keep only numeric
df$c2 <- sub("(\\d*\\.?\\d*)([A-Za-z]*)", "\\1", df$c1)

# keep only alphabets
df$c3 <- sub("(\\d*\\.?\\d*)([A-Za-z]*)", "\\2", df$c1)
df[df == ""] = NA

df
#>     c1   c2   c3
#> 1 5.5K  5.5    K
#> 2   2M    2    M
#> 3  3.1  3.1 <NA>
#> 4    M <NA>    M

Created on 2019-04-16 by the reprex package (v0.2.1)

cropgen
  • 1,920
  • 15
  • 24
0

We can use extract from tidyr

library(tidyr)
extract(df, c1, into = c("c2", "c3"), "^([0-9.]*)([A-Z]*)",
        convert = TRUE, remove = FALSE)
#    c1  c2 c3
#1 5.5K 5.5  K
#2   2M 2.0  M
#3  3.1 3.1   
#4    M  NA  M

Or with read.csv from base R

read.csv(text= sub("^([0-9.]*)", "\\1,", df$c1), 
   header = FALSE, stringsAsFactors = FALSE, col.names = c("c2", "c3"))
akrun
  • 874,273
  • 37
  • 540
  • 662
0

You could use the package unglue :

df <- data.frame(c1 = c("5.5K", "2M", "3.1", "M"))

library(unglue)
unglue_unnest(df, c1, "{c2}{c3=\\D*}", convert = TRUE)
#>    c2 c3
#> 1 5.5  K
#> 2 2.0  M
#> 3 3.1   
#> 4  NA  M
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167