5

This is my dataframe:

df <- tibble(col1 = c("1. word","2. word","3. word","4. word","5. N. word","6. word","7. word","8. word"))

I need to split in two columns using separate function and rename them as Numbers and other called Words. Ive doing this but its not working:

df %>% separate(col = col1 , into = c('Number','Words'), sep = "^. ")

The problem is that the fifth has 2 dots. And I dont know how to handle with this regarding the regex.

Any help?

Laura
  • 675
  • 10
  • 32

5 Answers5

4

Here is an alternative using readrs parse_number and a regex:

library(dplyr)
library(readr)
df %>% 
  mutate(Numbers = parse_number(col1), .before=1) %>% 
  mutate(col1 = gsub('\\d+\\. ','',col1))
  Numbers col1   
    <dbl> <chr>  
1       1 word   
2       2 word   
3       3 word   
4       4 word   
5       5 N. word
6       6 word   
7       7 word   
TarJae
  • 72,363
  • 6
  • 19
  • 66
3

A tidyverse approach would be to first clean the data then separate.

 df %>% 
      mutate(col1 = gsub("\\s.*(?=word)", "", col1, perl=TRUE)) %>% 
      tidyr::separate(col1, into = c("Number", "Words"), sep="\\.")

Result:

# A tibble: 8 x 2
  Number Words
  <chr>  <chr>
1 1      word 
2 2      word 
3 3      word 
4 4      word 
5 5      word 
6 6      word 
7 7      word 
8 8      word 
NelsonGon
  • 13,015
  • 7
  • 27
  • 57
3

I'm assuming that you would like to keep the cumbersome "N." in the result. For that, my advice is to use extract instead of separate:

df %>% 
  extract(
    col = col1 ,
    into = c('Number','Words'), 
    regex = "([0-9]+)\\. (.*)")

The regular expression ([0-9]+)\\. (.*) means that you are looking first for a number, that you want to put in a first column, followed by a dot and a space (\\. ) that should be discarded, and the rest should go in a second column.

The result:

# A tibble: 8 × 2
  Number Words  
  <chr>  <chr>  
1 1      word   
2 2      word   
3 3      word   
4 4      word   
5 5      N. word
6 6      word   
7 7      word   
8 8      word 
Vincent Guillemot
  • 3,394
  • 14
  • 21
3

Try read.table + sub

> read.table(text = sub("\\.", ",", df$col1), sep = ",")
  V1       V2
1  1     word
2  2     word
3  3     word
4  4     word
5  5  N. word
6  6     word
7  7     word
8  8     word
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
2

I am not sure how to do this with tidyr, but the following should work with base R.

df$col1 <- gsub('N. ', '', df$col1)
df$Numbers <- as.numeric(sapply(strsplit(df$col1, ' '), '[', 1))
df$Words <- sapply(strsplit(df$col1, ' '), '[', 2)
df$col1 <- NULL

Result

> head(df)
  Numbers Words
1       1  word
2       2  word
3       3  word
4       4  word
5       5  word
6       6  word
Dion Groothof
  • 1,406
  • 5
  • 15