how to remove words of specific length in a string in R?

Question

I want to remove words of length less than 3 in a string. for example my input is

str<- c("hello RP have a nice day")

I want my output to be

str<- c("hello have nice day")

Please help

Better not use str as a variable name. str is a built-in function of R. — Ven Yao, Oct 20 '15 at 01:38

Shenglin Chen · Accepted Answer · 2015-10-20T01:53:17.877

12

Try this:

gsub('\\b\\w{1,2}\\b','',str)
[1] "hello  have  nice day"

EDIT \b is word boundary. If need to drop extra space,change it as:

gsub('\\b\\w{1,2}\\s','',str)

Or

gsub('(?<=\\s)(\\w{1,2}\\s)','',str,perl=T)

edited Oct 20 '15 at 01:53

answered Oct 20 '15 at 01:39

Shenglin Chen

4,504
11
11

2

perhaps add a bit of explanation as to what the regexes are doing? – hrbrmstr Oct 20 '15 at 01:40
1

I like the approach to use just base R. But all three solutions make one of these three "mistakes": (1) remove substrings of length 1 or 2 when connected to a longer substring through a minus (as in "co-selection"); (2) Not remove substrings of length 1 or 2 at the end of the string; (3) Not remove substrings at the beginning of the string. The first solution makes the first mistake, the second solution makes the second mistake, and the third solution makes the second and third mistake. How can I not make any of these mistakes? – hyco Jan 20 '17 at 11:36

score 3 · Answer 2 · answered Oct 20 '15 at 02:23

3

Or use str_extract_all to extract all words that have length >=3 and paste

library(stringr)
paste(str_extract_all(str, '\\w{3,}')[[1]], collapse=' ')
#[1] "hello have nice day"

answered Oct 20 '15 at 02:23

akrun

874,273
37
540
662

there is an error what I got when I tried this.. `SubConsolData$ProductTitle <- paste(str_extract_all(SubConsolData$ProductTitle, '\\w{3,}')[[1]], collapse=' ')` and the error is first row of DF (`SubConsolData`) is repeated to all the rest of the rows in the DF. – LeMarque Jul 05 '18 at 12:31
1

@I_m_LeMarque It is because we are extracting the first element `[[1]]`. In this case, there is only a single string. In your case you may need to loop and then do the `paste` – akrun Jul 05 '18 at 15:12

score 3 · Answer 3 · answered Oct 21 '15 at 01:32

Here's an approach using the rm_nchar_words function from the qdapRegex package that I coauthored with @hwnd (SO regex guru extraordinaire). Here I show removing 1-2 letter words and then 1-3 letter words:

str<- c("hello RP have a nice day")

library(qdapTools)

rm_nchar_words(str, "1,2")
## [1] "hello have nice day"

rm_nchar_words(str, "1,3")
## [1] "hello have nice"

As qdapRegex aims to teach here is the regex behind the scene where the S function puts 1,2 into the quantifier curly braces:

S("@rm_nchar_words", "1,2")
##  "(?<![\\w'])(?:'?\\w'?){1,2}(?![\\w'])"

Ven Yao · Answer 4 · 2015-11-13T07:58:20.550

2

x <- "hello RP have a nice day"
z <- unlist(strsplit(x, split=" "))
paste(z[nchar(z)>=3], collapse=" ")
# [1] "hello have nice day"

edited Nov 13 '15 at 07:58

answered Oct 20 '15 at 01:37

Ven Yao

3,680
2
27
42

how to remove words of specific length in a string in R?

4 Answers4

Linked

Related