Separate string column by row efficiently

Question

I'm trying to separate a string column into two pieces based on chopping up the string. It's best illustrated with example below. rowwise does work, but given the size of the data.frame, I'd like to use a more efficient method. How can I avoid using rowwise?

library(dplyr)
library(stringr)
library(tidyr)

#make data
a <- "(1, 10)"
b <- "(10, 20)"
c <- "(20, 30)"

df <- data.frame(size = c(a,b,c))

# Goal is to separate the 'size' column into 'lower' and 'upper' by
# extracting the value contained in the parens and split by a comma.
# Once the column is split into 'upper' and 'lower' I will perform 
# additional operations.

# DESIRED RESULT
  size     lower upper 
  <fct>    <chr> <chr> 
1 (1, 10)  1     10
2 (10, 20) 10    20
3 (20, 30) 20    30

# WHAT I HAVE TRIED

> #This works... but too inefficient
> df %>%
+   rowwise() %>%
+   mutate(lower = str_split(size, ",") %>% .[[1]] %>% .[1] %>%
+            str_split("\\(") %>% .[[1]] %>% .[2])
  size     lower
  <fct>    <chr>
1 (1, 10)  1    
2 (10, 20) 10   
3 (20, 30) 20   

> # I'm not sure why this doesn't work
> df %>%
+   mutate(lower = str_split(size, ",") %>% .[[1]] %>% .[1] %>%
+            str_split("\\(") %>% .[[1]] %>% .[2])
      size lower
1  (1, 10)     1
2 (10, 20)     1
3 (20, 30)     1

> #Not obivous how to use separate (tidyr)
> df %>%
+   separate(size, sep=",", c("lower", "upper"))
  lower upper
1    (1   10)
2   (10   20)
3   (20   30)

score 1 · Answer 1 · answered Apr 19 '18 at 16:19

1

For rowwise operations, I prefer data.table.

Try this

library(data.table)
library(stringi)

#make data
a <- "(1, 10)"
b <- "(10, 20)"
c <- "(20, 30)"

dt <- data.table(c(a,b,c))
dt[, lower := tstrsplit(V1, ",")[1]]
dt[, lower:= stri_replace_all_regex(lower, '\\(', '')]

dt

answered Apr 19 '18 at 16:19

svaradan

11
2

Nice answer, but you might want to show OP how to use this without data table in case they are unfamiliar with the syntax. (Also you have a typo, "tstrsplit".) – Gregor Thomas Apr 19 '18 at 16:22
I'm not too familiar with `data.table` but I will be stringing together a lot of `dplyr` via the `%>%` so if there is a native `dplyr` way to do it that would work better I think. Your solution does work though so I will look to see if I can make it work. – Lloyd Christmas Apr 19 '18 at 16:45

Gregor Thomas · Accepted Answer · 2018-04-19T16:55:16.443

1

You don't state your goal explicitly, but it seems like you want to extract the first number from a string. This is easy with stringi::str_extract_first_regex

library(stringi)
stri_extract_first_regex(df$size, "[0-9]+")
# [1] "1"  "10" "20"

So in your case,

df %>% mutate(lower = as.numeric(stri_extract_first_regex, size, "[0-9]+"))

You can extract all numbers with stri_extract_all_regex.

Based on your edits:

df$nums = str_extract_all(df$size, "[0-9]+")
df$lower = as.numeric(sapply(df$nums, `[[`, 1))
df$upper = as.numeric(sapply(df$nums, `[[`, 2))
df
#       size   nums lower upper
# 1  (1, 10)  1, 10     1    10
# 2 (10, 20) 10, 20    10    20
# 3 (20, 30) 20, 30    20    30

Another way to go is to get rid of the parens and whitespace and then use separate:

df %>%
    mutate(just_nums = str_replace_all(size, "[^0-9,]", "")) %>%
    separate(just_nums, into = c("lower", "upper"))
#       size lower upper
# 1  (1, 10)     1    10
# 2 (10, 20)    10    20
# 3 (20, 30)    20    30

The regex pattern "[^0-9,]" matches everything except numbers and commas.

edited Apr 19 '18 at 16:55

answered Apr 19 '18 at 16:20

Gregor Thomas

136,190
20
167
294

I updated my question with a more clear definition of what I want as a result. I can't figure out how to make what you propose here to work. – Lloyd Christmas Apr 19 '18 at 16:42
The second solution is exactly what I wanted, and it keeps it all neatly in `dplyr` for subsequent operations. (with the use of the `stringi` package) – Lloyd Christmas Apr 19 '18 at 17:06
This doesn't appear to work with decimals.. For example if the first row was `(0.1, 10)` this code returns 1 and 10, instead of 0.1 and 10. – Lloyd Christmas Apr 19 '18 at 20:00
You can modify the regex to include decimals. Make the replace pattern `"[^0-9,\\.]"`, or the extract pattern `"[0-9\\.]+"`. Alternatively, if your strings are as shown, you could make the replace pattern `"[\\(\\) ]"` to replace only parens and spaces. – Gregor Thomas Apr 19 '18 at 20:20

MKR · Answer 3 · 2018-04-19T18:25:00.977

1

An option is to use tidyr::separate after removing both ( and ) from the data.

library(tidyverse)
df %>% mutate(size = gsub("\\(|)","",size)) %>%  # Both ( and ) has been removed.
  separate(size, c("Min", "Max"), sep = ",")
#   Min Max
# 1   1  10
# 2  10  20
# 3  20  30

edited Apr 19 '18 at 18:25

answered Apr 19 '18 at 18:15

MKR

19,739
4
23
33

PKumar · Answer 4 · 2018-04-19T17:00:49.807

You are almost there. Here is my explanation for two approach, one is similar to yours:

In the first code, I have used unnest_tokens from tidytext package, which can split words on a different rows, since you want to extract the first item before the comma(I have assumed it basis your example, although you should mention it). I have choosen the first row basis this by using filter command.

In the second code, I have used the regex (note you can also use here str_replace also). Here I am using map(since the items returned by str_split is a list) to iterate the returned items and pass each by gsub, which can replace the regex matched with the back referencing items. Also to select only the first item, I have used [[1]] in the end of gsub.

library(tidyverse)
library(stringr)
library(tidytext)
df %>% 
    unnest_tokens(lower,size, token="words",drop=F) %>% 
    filter(row_number()%%2==T)

df %>% 
    mutate(lower = map(str_split(df$size, ","), function(x)gsub("\\((\\w+)","\\1",x)[[1]]))

Output:

   #       size lower
   # 1  (1, 10)     1
   # 2 (10, 20)    10
   # 3 (20, 30)    20

In case you want to extract both the terms before and after the commas, you can use extract function as well.

tidyr::extract(df, size, c("lower", "upper"), regex= "\\((\\w+),\\s+(\\w+)\\)")

Output:

  #   lower upper
   # 1     1    10
   # 2    10    20
   # 3    20    30

Separate string column by row efficiently

4 Answers4