1

My dataframe Expenses is as shown below :

date        name           expenditure      type
23MAR2013   KOSH ENTRP     4000             COMPANY
23MAR2013   JOHN DOE       800              INDIVIDUAL
24MAR2013   S KHAN         300              INDIVIDUAL
24MAR2013   JASINT PVT LTD 8000             COMPANY
25MAR2013   KOSH ENTRPRISE 2000             COMPANY
25MAR2013   JOHN S DOE     220              INDIVIDUAL
25MAR2013   S KHAN         300              INDIVIDUAL
26MAR2013   S KHAN         300              INDIVIDUAL

Earlier, I had identified the presence of repetitive names and patterns from the name column and stored it in a vector NameVector and it is as shown below.

KOSH    JOHN DOE    KHAN    JASINT

My question is, how do I match each and every string pattern of Expenses$name with the vector NameVector and print it in a categorical way in the main data-frame?

date        name           expenditure      type           category 
23MAR2013   KOSH ENTRP     4000             COMPANY        KOSH
23MAR2013   JOHN DOE       800              INDIVIDUAL     JOHN DOE
24MAR2013   S KHAN         300              INDIVIDUAL     KHAN          
24MAR2013   JASINT PVT LTD 8000             COMPANY        JASINT
25MAR2013   KOSH ENTRPRISE 2000             COMPANY        KOSH
25MAR2013   JOHN S DOE     220              INDIVIDUAL     JOHN DOE
25MAR2013   SALM KHAN      300              INDIVIDUAL     KHAN
26MAR2013   S KHAN         300              INDIVIDUAL     KHAN

I tried splitting the column name by every possible delimiter (spaces, |, *, commas etc) using strsplit() to get the different parts of the names into different columns and try matching the patterns using agrep() but I am not getting the desired output. Further introspection into the data, I have noticed that there were leading whitespaces and got rid of them, still no clue why I am not getting the output as show above.


The csv for the above table :

"Date","name","expenditure","type"
"23MAR2013","KOSH ENTRP",4000,"COMPANY"
"23MAR2013 ","JOHN DOE",800,"INDIVIDUAL"
"24MAR2013","S KHAN",300,"INDIVIDUAL"
"24MAR2013","JASINT PVT LTD",8000,"COMPANY"
"25MAR2013","KOSH ENTRPRISE",2000,"COMPANY"
"25MAR2013","JOHN S DOE",220,"INDIVIDUAL"
"25MAR2013","S KHAN",300,"INDIVIDUAL"
"26MAR2013","S KHAN",300,"INDIVIDUAL"

and the names vector that has been calculated/identifies as

NameVector <- c("KOSH","JOHN DOE","KHAN","JASINT")
Jaap
  • 81,064
  • 34
  • 182
  • 193
sunitprasad1
  • 768
  • 2
  • 12
  • 28

1 Answers1

2

You could try

library(stringi)
pat <- paste(unlist(strsplit(NameVector, ' ')), collapse="|")
Expenses$category <- vapply(stri_extract_all_regex(Expenses$name, pat), 
           paste, collapse=' ', character(1L))
Expenses
#       date           name expenditure       type category
#1 23MAR2013     KOSH ENTRP        4000    COMPANY     KOSH
#2 23MAR2013       JOHN DOE         800 INDIVIDUAL JOHN DOE
#3 24MAR2013         S KHAN         300 INDIVIDUAL     KHAN
#4 24MAR2013 JASINT PVT LTD        8000    COMPANY   JASINT
#5 25MAR2013 KOSH ENTRPRISE        2000    COMPANY     KOSH
#6 25MAR2013     JOHN S DOE         220 INDIVIDUAL JOHN DOE
#7 25MAR2013         S KHAN         300 INDIVIDUAL     KHAN
#8 26MAR2013         S KHAN         300 INDIVIDUAL     KHAN
akrun
  • 874,273
  • 37
  • 540
  • 662
  • It worked and this is the fourth time that you have saved my a.. @akrun BTW, where and how do you learn all the stuff? I guess, your experience with R. Please let me know.. – sunitprasad1 Mar 10 '15 at 05:25
  • 1
    @sunitprasad1 Glad to know that it works. I would say invest some time (everyday) for practising, solving R questions and you will find it easier after sometime.. It may be a bit hard in the beginning, but try practising for 21 days and then it will become a routine – akrun Mar 10 '15 at 05:30