5

I have an example data frame like the one below.

ID File
1 11_213.csv
2 13_256.csv
3 11_223.csv
4 12_389.csv
5 14_456.csv
6 12_345.csv

And I want to add another column based on the string between the underscore and the period to get a data frame that looks something like this.

ID File Group
1 11_213.csv 213
2 13_256.csv 256
3 11_223.csv 223
4 12_389.csv 389
5 14_456.csv 456
6 12_345.csv 345

I think I need to use the str_extract feature within stringr but I am not sure what notation to use for my pattern. For example when I use:

df <- df %>%
mutate("Group" = str_extract(File, "[^_]+"))

I get the all the information before the underscore like this:

ID File Group
1 11_213.csv 11
2 13_256.csv 13
3 11_223.csv 11
4 12_389.csv 12
5 14_456.csv 14
6 12_345.csv 12

But that is not what I want. What should I use instead of "[^_]+" to get just the stuff between the underscore and the period? Thanks!

jay.sf
  • 60,139
  • 8
  • 53
  • 110
beanboy
  • 217
  • 1
  • 9

2 Answers2

7

We can use a regex lookaround to extract the digits (\\d+) that succeeds a _ and precedes a . with str_extract

library(dplyr)
library(stringr)
df <- df %>%
    mutate(Group = str_extract(File, "(?<=_)(\\d+)(?=\\.)")

Or another option is to remove the substring with str_remove i.e to match characters (.*) including the _ or (|) characters from . onwards (. can match any character in regex mode - which is by default, so we escape \\ it for literal matching)

df <- df %>%
        mutate(Group = str_remove_all(File, ".*_|\\..*"))
akrun
  • 874,273
  • 37
  • 540
  • 662
3

A base R option using gsub

transform(
  df,
  Group = gsub(".*_(\\d+)\\..*", "\\1", File)
)

gives

  ID       File Group
1  1 11_213.csv   213
2  2 13_256.csv   256
3  3 11_223.csv   223
4  4 12_389.csv   389
5  5 14_456.csv   456
6  6 12_345.csv   345
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81