1

I was looking for a intuitive solution for a problem of mine. I have a huge list of words, in which i have to insert a special character based on some criteria. So if a two/three letter word appear in a cell i want to add "+" right and left to it

Example

global b2b banking would transform to global +b2b+ banking

how to finance commercial ale estate would transform to how +to+ finance commercial +ale+ estate

Here is sample data set:

sample <- c("commercial funding",
"global b2b banking"
"how to finance commercial ale estate"
"opening a commercial account",
"international currency account",
"miami imports banking",
"hsbc supply chain financing",
"international business expansion",
"grow business in Us banking",
"commercial trade Asia Pacific",
"business line of credits hsbc",
"Britain commercial banking",
"fx settlement hsbc",
 "W Hotels")
data <- data.frame(sample)

Additionally is it possible to drop a row which has a character of length 1 ? Example:

W Hotels

For all the one letter word i tried removing them with gsub,

gsub(" *\\b[[:alpha:]]{1,1}\\b *", " ", sample) 

This should be removed from the data set set.

Any help is highly appreciated.

Edit 1

Thanks for the help, I added few more lines to it:

sample <- c("commercial funding", "global b2b banking", "how to finance commercial ale estate", "opening a commercial account","international currency account","miami imports banking","hsbc supply chain financing","international business expansion","grow business in Us banking", "commercial trade Asia Pacific","business line of credits hsbc","Britain commercial banking","fx settlement hsbc", "W Hotels")
sample <- sample[!grepl("\\b[[:alpha:]]\\b",sample)]
sample <- gsub("\\b([[:alpha:][:digit:]]{2,3})\\b", "+\\1+", sample)
sample <- gsub(" ",",",sample)
sample <- gsub("+,","+",sample)
sample <- gsub(",+","+",sample)
sample <- tolower(sample)
sample <- ifelse(substr(sample, 1, 1) == "+", sub("^.", "", sample), sample)
data <- data.frame(sample)
data




                                          sample
1                             commercial++funding
2                          global+++b2b+++banking
3  how++++to+++finance++commercial+++ale+++estate
4                international++currency++account
5                         miami++imports++banking
6                  hsbc++supply++chain++financing
7              international++business++expansion
8             grow++business+++in++++us+++banking
9                commercial++trade++asia++pacific
10            business++line+++of+++credits++hsbc
11                   britain++commercial++banking
12                          fx+++settlement++hsbc

Somehow i am unable to remove "+," with "," with gsub ? what am i doing wrong ? so "fx+,settlement,hsbc" should be "fx+settlement,hsbc" but it is replacing , wth additional ++.

PSraj
  • 229
  • 4
  • 10

1 Answers1

2

You need to do that in 2 steps: remove the items with 1-letter whole words, and then add + around 2-3 letter words.

Use

sample <- c("commercial funding", "global b2b banking", "how to finance commercial ale estate", "opening a commercial account","international currency account","miami imports banking","hsbc supply chain financing","international business expansion","grow business in Us banking", "commercial trade Asia Pacific","business line of credits hsbc","Britain commercial banking","fx settlement hsbc", "W Hotels")
sample <- sample[!grepl("\\b[[:alnum:]]\\b",sample)]
sample <- gsub("\\b([[:alnum:]]{2,3})\\b", "+\\1+", sample)
data <- data.frame(sample)
data

See the R demo

The sample[!grepl("\\b[[:alnum:]]\\b",sample)] removes the items that contain word boundary (\b), a letter ([[:alnum:]]) and a word boundary pattern.

The gsub("\\b([[:alnum:]]{2,3})\\b", "+\\1+", sample) line replaces all 2-3-letter whole words with these words enclosed with +.

Result:

                                       sample
1                          commercial funding
2                        global +b2b+ banking
3  +how+ +to+ finance commercial +ale+ estate
4              international currency account
5                       miami imports banking
6                 hsbc supply chain financing
7            international business expansion
8             grow business +in+ +Us+ banking
9               commercial trade Asia Pacific
10            business line +of+ credits hsbc
11                 Britain commercial banking
12                       +fx+ settlement hsbc

Note that W Hotels and opening a commercial account got filtered out.

Answer to the EDIT

You added some more replacement operations to the code, but you are using literal string replacements, thus, you just need to pass fixed=TRUE argument:

sample <- gsub(" ",",",sample, fixed=TRUE)
sample <- gsub("+,","+",sample, fixed=TRUE)
sample <- gsub(",+","+",sample, fixed=TRUE)

Else, the + is treated as a regex quantifier, and must be escaped to be treated as a literal plus symbol.

Also, if you need to remove all + from the start of the string, use

sample <- sub("^\\++", "", sample)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    If `b2b` is to become `+b2b+`, you'll need to include include `[:digit:]` in the pattern. – coletl Mar 02 '17 at 09:24
  • I replaced all the `[[:alpha:]]` (just letters) with `[[:alnum:]]` (letters + digits). Let OP decide what to use for filtering and what for `+` wrapping. – Wiktor Stribiżew Mar 02 '17 at 09:29
  • Your solution works great, just final thing i am stuck with, i am unable to gsub +, to just + , can you help here ? – PSraj Mar 02 '17 at 11:21
  • What do you mean by `to gsub +, to just + ,`? Do you need to insert a space between `+` and `,`? – Wiktor Stribiżew Mar 02 '17 at 11:25
  • i replaced spaced with comma so now i had `+fx+,settlement,hsbc` now i want to transform this to `+fx+settlement,hsbc` , so basically all instances of `+,` should be `+` hope i made my self clear now. – PSraj Mar 02 '17 at 11:32