r substring wildcard search to find text

Question

I have a data.frame column that has values such as below. I want to use each cell and create two columns- num1 and num2 such that num1=everything before "-" and num2=everything between "-" and "."

I am thinking of using gregexpr function as shown here and write a for loop to iterate over each row. Is there a faster way to do this?

60-150.PNG
300-12.PNG

employee <- c('60-150.PNG','300-12.PNG')
employ.data <- data.frame(employee)

akrun · Accepted Answer · 2015-04-27T20:25:29.143

5

Try

library(tidyr)
extract(employ.data, employee, into=c('num1', 'num2'),
                    '([^-]*)-([^.]*)\\..*', convert=TRUE)
#   num1 num2
#1   60  150
#2  300   12

Or

library(data.table)#v1.9.5+
setDT(employ.data)[, tstrsplit(employee, '[-.]', type.convert=TRUE)[-3]]
#    V1  V2
#1:  60 150
#2: 300  12

Or based on @rawr's comment

 read.table(text=gsub('-|.PNG', ' ', employ.data$employee),
           col.names=c('num1', 'num2'))
 #   num1 num2
 #1   60  150
 #2  300   12

Update

To keep the original column

extract(employ.data, employee, into=c('num1', 'num2'), remove=FALSE,
        '([^-]*)-([^.]*)\\..*', convert=TRUE)
#    employee num1 num2
#1 60-150.PNG   60  150
#2 300-12.PNG  300   12

Or

 setDT(employ.data)[, paste0('num', 1:2) := tstrsplit(employee, 
             '[-.]', type.convert=TRUE)[-3]]
 #     employee num1 num2
 #1: 60-150.PNG   60  150
 #2: 300-12.PNG  300   12

Or

 cbind(employ.data, read.table(text=gsub('-|.PNG', ' ', 
     employ.data$employee),col.names=c('num1', 'num2')))
 #    employee num1 num2
 #1 60-150.PNG   60  150
 #2 300-12.PNG  300   12

edited Apr 27 '15 at 20:25

answered Apr 27 '15 at 17:13

akrun

874,273
37
540
662

2

akrun taught me this one `read.table(text = gsub('-|.PNG', ' ', dat$employee))` – rawr Apr 27 '15 at 18:10
how would i assign column names in case of read.table(text = gsub('-|.PNG', ' ', dat$employee))? – user2543622 Apr 27 '15 at 19:50
try `read.table(text=gsub('-|.PNG', ' ', employ.data$employee), col.names=c('num1', 'num2'))` Or use `setNames(read.table(text=gsub('-|.PNG', ' ', employ.data$employee)), paste0('num', 1:2))` – akrun Apr 27 '15 at 19:51
perfect. Please modify your answer and i will accept it – user2543622 Apr 27 '15 at 19:52
any way to add 2 new columns to the the original data.frame? – user2543622 Apr 27 '15 at 20:20
@user2543622 Did you mean to keep the original column also in the dataset – akrun Apr 27 '15 at 20:21
yes. the original data.frame will get 2 new columns num1 and num2. Old columns will remain as it is – user2543622 Apr 27 '15 at 20:22
@user2543622 Updated the post, please check if it helps – akrun Apr 27 '15 at 20:26
it works. but i had to change "cbind(employ.data,..." to "cbind(employ$data,.." – user2543622 Apr 27 '15 at 20:30
@user2543622 That is strange. It works with R 3.2.0. though – akrun Apr 27 '15 at 20:32

A5C1D2H2I1M1N2O1R2T1 · Answer 2 · 2015-04-27T17:27:11.307

You can try cSplit from my "splitstackshape" package:

library(splitstackshape)
cSplit(employ.data, "employee", "-|.PNG", fixed = FALSE)
#    employee_1 employee_2
# 1:         60        150
# 2:        300         12

Since you mention gregexpr, you can probably try something like:

do.call(rbind, 
        regmatches(as.character(employ.data$employee), 
                   gregexpr("-|.PNG", employ.data$employee), 
                   invert = TRUE))[, -3]
     [,1]  [,2] 
[1,] "60"  "150"
[2,] "300" "12"

David Arenburg · Answer 3 · 2015-04-27T19:59:19.837

3

Another option using stringi

library(stringi)
data.frame(type.convert(stri_split_regex(employee, "[-.]", simplify = TRUE)[, -3]))
#    X1  X2
# 1  60 150
# 2 300  12

edited Apr 27 '15 at 19:59

answered Apr 27 '15 at 18:11

David Arenburg

91,361
17
137
196

score 2 · Answer 4 · answered Apr 27 '15 at 17:58

2

Or with the simple gsub.

gsub("-.*", "", employ.data$employee) # substitute everything after - with nothing
gsub(".*-(.*)\\..*", "\\1", employ.data$employee) #keep only anything between - and .

answered Apr 27 '15 at 17:58

dimitris_ps

5,849
3
29
55

score 1 · Answer 5 · answered Apr 27 '15 at 17:17

1

The strsplit function will give you what you're looking for, output to a list.

employee <- c('60-150.PNG','300-12.PNG')
strsplit(employee, "[-]")

##Output:

[[1]]
[1] "60"      "150.PNG"

[[2]]
[1] "300"    "12.PNG"

Note the second argument to strsplit is a regex value, not just a character to split on, so more complicated regexp can be used.

answered Apr 27 '15 at 17:17

economy

4,035
6
29
37

1

This should be probably something like `data.frame(lapply(strsplit(sub("\\..*", "", employee), "-"), type.convert))` – David Arenburg Apr 27 '15 at 19:55

r substring wildcard search to find text

5 Answers5

Update