1

I have list of codes as below

ccode<-c('S','PD','CH','ML','MD','VA','BVI','DB','KD','KE','PW','COL','AD','MET','VP','SI','VR','GAO','LK','RP','PAD','WAN','PWD','PMP','PBR','VN','PPC','NK','K','AH','I','JP','JU','UDZ','CHM','DDN','LN','CL','CLH','DKM','GK','WD','ED','DDK','DLN','DRN','DFD','GZB','DVV','GUR','GGN','ND','HHN','HAS','HYD','HKP','BWF','BBW','BKM','BSN','BL','BIN','ST','KN')

Now, I want to extract a string from below sample that starts with the code

consolidated_csv_v2 <- c("pt paid rs-8488/-  remaining amt","Credit Card Sales","ML 2926 VARSHA LAKHANI (AG)","IMRAN KHAN-PW-4798","Deepali Mishra Ah-5564 Tst", "MANJU S-11226 T","SNEHA S-16191","SUMIT SETHI AH-5747 AG","SUJATA VORA AH-5361 AG","Deepali Mishra Ah-5564 Tst")

data is spread across 477326 rows

Expected output is code followed by number.

str_extract(consolidated_csv_v2, "AH.*$")

[1] NA           NA           NA           NA           NA           NA          
[7] NA           "AH-5747 AG" "AH-5361 AG" NA  AG"

This formula worked only on static code "AH". How can I do the same for match with any of codes in ccode.

M--
  • 25,431
  • 8
  • 61
  • 93
Shankar Pandala
  • 969
  • 2
  • 8
  • 28

2 Answers2

2

We can try

pat <- paste0("(?i)\\b(", paste(ccode, collapse="|"),")-.*")
str_extract(v1, pat)
#[1] NA            NA            NA            NA            "Ah-5564 Tst" NA            "AH-2445 AG"  "AH-5747 AG"  "AH-5361 AG"  "Ah-5564 Tst"

data

v1 <- c("Head Office", "(cancelled)", "(cancelled)", "(cancelled)", 
"Deepali Mishra Ah-5564 Tst", "(cancelled)", "SHRUTI BHAGAT AH-2445 AG", 
"SUMIT SETHI AH-5747 AG", "SUJATA VORA AH-5361 AG", "Deepali Mishra Ah-5564 Tst")
akrun
  • 874,273
  • 37
  • 540
  • 662
2

I assume you need to extract the substrings starting with a "code" after a word boundary and followed with a hyphen.

Then, use

 "\\b(?:S|PD|CH|ML|MD|VA|BVI|DB|KD|KE|PW|COL|AD|MET|VP|SI|VR|GAO|LK|RP|PAD|WAN|PWD|PMP|PBR|VN|PPC|NK|K|AH|I|JP|JU|UDZ|CHM|DDN|LN|CL|CLH|DKM|GK|WD|ED|DDK|DLN|DRN|DFD|GZB|DVV|GUR|GGN|ND|HHN|HAS|HYD|HKP|BWF|BBW|BKM|BSN|BL|BIN|ST|KN)-\\w*"

where \b stands for a word boundary, then a group of code alternatives follows ((?:...)), and then a hyphen (-) followed with zero or more alphanumeric/underscore symbols (\w*).

And here is a demo code:

> consolidated_csv_v2 <- c("Head Office","(cancelled)","(cancelled)","(cancelled)","Deepali Mishra Ah-5564 Tst", "(cancelled)","SHRUTI BHAGAT AH-2445 AG","SUMIT SETHI AH-5747 AG","SUJATA VORA AH-5361 AG","Deepali Mishra Ah-5564 Tst")
> ccode<-c('S','PD','CH','ML','MD','VA','BVI','DB','KD','KE','PW','COL','AD','MET','VP','SI','VR','GAO','LK','RP','PAD','WAN','PWD','PMP','PBR','VN','PPC','NK','K','AH','I','JP','JU','UDZ','CHM','DDN','LN','CL','CLH','DKM','GK','WD','ED','DDK','DLN','DRN','DFD','GZB','DVV','GUR','GGN','ND','HHN','HAS','HYD','HKP','BWF','BBW','BKM','BSN','BL','BIN','ST','KN')
> reg <- paste0("\\b(?:", paste(ccode, collapse="|"),")-\\w*")
> str_extract(consolidated_csv_v2, reg)
 [1] NA        NA        NA        NA        NA        NA        "AH-2445"
 [8] "AH-5747" "AH-5361" NA       
> 

UPDATE

not all the words are followed by '-', some are follwed by a space and some don't have any character in between.

The requirement is rather a general one, but we can meet it using a lazy dot matching (.*?) after the group of alternations to match any 0+ characters other than a newline as few as possible up to the first set of digits (\d+) that are followed with a word boundary (\b). Use

reg <- paste0("(?i)\\b(?:", paste(ccode, collapse="|"),").*?\\d+\\b")

See the regex demo

To make this pattern case-insensitive, just add a (?i) in front of the first \b.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thank you so much. not all the words are followed by '-', some are follwed by a space and some don't have any character in between. but all ends with a number only. It would be helpful if you can provide solution for that too. And, any help on how to deal with the single characters like 'S' etc,. – Shankar Pandala Apr 15 '16 at 09:29
  • Try `reg <- paste0("\\b(?:", paste(ccode, collapse="|"),").*?\\d+\\b")` – Wiktor Stribiżew Apr 15 '16 at 09:31
  • I do not like `.*?` very much, but your *some are followed by a space and some don't have any character in between* sounds not precise. Please check the actual requirements and update the question so that we could help you with the safest pattern. – Wiktor Stribiżew Apr 15 '16 at 09:34
  • Thank you So much for your efforts again. I've updated the test set that match all my requirements. – Shankar Pandala Apr 15 '16 at 09:51
  • Done! Thank you so much! This is case sensitive. How to make it not case sensitive? – Shankar Pandala Apr 15 '16 at 10:51
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/109244/discussion-between-shankar-pandala-and-wiktor-stribizew). – Shankar Pandala Apr 15 '16 at 12:16