Split columns considering only the first dot in R using separate

Question

This is my dataframe:

df <- tibble(col1 = c("1. word","2. word","3. word","4. word","5. N. word","6. word","7. word","8. word"))

I need to split in two columns using separate function and rename them as Numbers and other called Words. Ive doing this but its not working:

df %>% separate(col = col1 , into = c('Number','Words'), sep = "^. ")

The problem is that the fifth has 2 dots. And I dont know how to handle with this regarding the regex.

Any help?

Do you want to keep `N.word` or just `word` for 5? – NelsonGon Jan 01 '22 at 19:45 — NelsonGon, Jan 01 '22 at 19:45

score 4 · Answer 1 · answered Jan 01 '22 at 20:01

Here is an alternative using readrs parse_number and a regex:

library(dplyr)
library(readr)
df %>% 
  mutate(Numbers = parse_number(col1), .before=1) %>% 
  mutate(col1 = gsub('\\d+\\. ','',col1))

  Numbers col1   
    <dbl> <chr>  
1       1 word   
2       2 word   
3       3 word   
4       4 word   
5       5 N. word
6       6 word   
7       7 word

score 3 · Accepted Answer · answered Jan 01 '22 at 19:43

A tidyverse approach would be to first clean the data then separate.

 df %>% 
      mutate(col1 = gsub("\\s.*(?=word)", "", col1, perl=TRUE)) %>% 
      tidyr::separate(col1, into = c("Number", "Words"), sep="\\.")

Result:

# A tibble: 8 x 2
  Number Words
  <chr>  <chr>
1 1      word 
2 2      word 
3 3      word 
4 4      word 
5 5      word 
6 6      word 
7 7      word 
8 8      word

score 3 · Answer 3 · answered Jan 01 '22 at 19:44

I'm assuming that you would like to keep the cumbersome "N." in the result. For that, my advice is to use extract instead of separate:

df %>% 
  extract(
    col = col1 ,
    into = c('Number','Words'), 
    regex = "([0-9]+)\\. (.*)")

The regular expression ([0-9]+)\\. (.*) means that you are looking first for a number, that you want to put in a first column, followed by a dot and a space (\\. ) that should be discarded, and the rest should go in a second column.

The result:

# A tibble: 8 × 2
  Number Words  
  <chr>  <chr>  
1 1      word   
2 2      word   
3 3      word   
4 4      word   
5 5      N. word
6 6      word   
7 7      word   
8 8      word

score 3 · Answer 4 · answered Jan 01 '22 at 22:38

3

Try read.table + sub

> read.table(text = sub("\\.", ",", df$col1), sep = ",")
  V1       V2
1  1     word
2  2     word
3  3     word
4  4     word
5  5  N. word
6  6     word
7  7     word
8  8     word

answered Jan 01 '22 at 22:38

ThomasIsCoding

96,636
9
24
81

Dion Groothof · Answer 5 · 2022-01-01T20:08:04.400

2

I am not sure how to do this with tidyr, but the following should work with base R.

df$col1 <- gsub('N. ', '', df$col1)
df$Numbers <- as.numeric(sapply(strsplit(df$col1, ' '), '[', 1))
df$Words <- sapply(strsplit(df$col1, ' '), '[', 2)
df$col1 <- NULL

Result

> head(df)
  Numbers Words
1       1  word
2       2  word
3       3  word
4       4  word
5       5  word
6       6  word

edited Jan 01 '22 at 20:08

answered Jan 01 '22 at 19:41

Dion Groothof

1,406
5
15

Split columns considering only the first dot in R using separate

5 Answers5