6

I want to remove words of length less than 3 in a string. for example my input is

str<- c("hello RP have a nice day")

I want my output to be

str<- c("hello have nice day")

Please help

areddy
  • 373
  • 3
  • 7
  • 18

4 Answers4

12

Try this:

gsub('\\b\\w{1,2}\\b','',str)
[1] "hello  have  nice day"

EDIT \b is word boundary. If need to drop extra space,change it as:

gsub('\\b\\w{1,2}\\s','',str)

Or

gsub('(?<=\\s)(\\w{1,2}\\s)','',str,perl=T)
Shenglin Chen
  • 4,504
  • 11
  • 11
  • 2
    perhaps add a bit of explanation as to what the regexes are doing? – hrbrmstr Oct 20 '15 at 01:40
  • 1
    I like the approach to use just base R. But all three solutions make one of these three "mistakes": (1) remove substrings of length 1 or 2 when connected to a longer substring through a minus (as in "co-selection"); (2) Not remove substrings of length 1 or 2 at the end of the string; (3) Not remove substrings at the beginning of the string. The first solution makes the first mistake, the second solution makes the second mistake, and the third solution makes the second and third mistake. How can I not make any of these mistakes? – hyco Jan 20 '17 at 11:36
3

Or use str_extract_all to extract all words that have length >=3 and paste

library(stringr)
paste(str_extract_all(str, '\\w{3,}')[[1]], collapse=' ')
#[1] "hello have nice day"
akrun
  • 874,273
  • 37
  • 540
  • 662
  • there is an error what I got when I tried this.. `SubConsolData$ProductTitle <- paste(str_extract_all(SubConsolData$ProductTitle, '\\w{3,}')[[1]], collapse=' ')` and the error is first row of DF (`SubConsolData`) is repeated to all the rest of the rows in the DF. – LeMarque Jul 05 '18 at 12:31
  • 1
    @I_m_LeMarque It is because we are extracting the first element `[[1]]`. In this case, there is only a single string. In your case you may need to loop and then do the `paste` – akrun Jul 05 '18 at 15:12
3

Here's an approach using the rm_nchar_words function from the qdapRegex package that I coauthored with @hwnd (SO regex guru extraordinaire). Here I show removing 1-2 letter words and then 1-3 letter words:

str<- c("hello RP have a nice day")

library(qdapTools)

rm_nchar_words(str, "1,2")
## [1] "hello have nice day"

rm_nchar_words(str, "1,3")
## [1] "hello have nice"

As qdapRegex aims to teach here is the regex behind the scene where the S function puts 1,2 into the quantifier curly braces:

S("@rm_nchar_words", "1,2")
##  "(?<![\\w'])(?:'?\\w'?){1,2}(?![\\w'])"
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
2
x <- "hello RP have a nice day"
z <- unlist(strsplit(x, split=" "))
paste(z[nchar(z)>=3], collapse=" ")
# [1] "hello have nice day"
Ven Yao
  • 3,680
  • 2
  • 27
  • 42