-2

I am working on a text analysis with R and have a dataset (text corpus) with various sentences about different fruits. For example: "apple", "banana" , "orange", "pear", etc.

Since it is not relevant for the analysis whether someone writes about "apples" or "bananas", I want to replace all different fruits with one specific word, for example "allfruits".

I thought about using regex but I am facing two issues;

1) I want to avoid separate code lines for each kind of fruit. Thus, is there a way to define a list or a vector that I can use so that the function replaces all words in that list (apple, bananas, pear, etc.) with one specific word "allfruits"?

2) I want to avoid that words that are NOT a fruit but contain the same string as a fruit (e.g. the word "appletini) get replaced by the function.

Example: If I have a sentence that says: "Apple is my favourite fruit, appletini is my favourite drink. I also like bananas!" I want following to happen: allfruits is my favourite fruit, appletini is my favourite drink. I also like allfruits!

I am not sure whether it is possible to write this with a gsub function. Thus, all help is much appreciated.

Thank you!

lole_emily
  • 95
  • 9

2 Answers2

1

allfruits can be extended to contain any words to be replaced:

allfruits = c("apple", "banana" , "orange", "pear")
replacement = "allfruits"
text = "Apple is my favourite fruit, appletini is my favourite drink. I also like bananas!"

gsub(paste0("\\b(", paste0(allfruits, collapse="|"), ")[s]?\\b"), replacement, text, ignore.case = TRUE)

Returns

[1] "allfruits is my favourite fruit, appletini is my favourite drink. I also like allfruits!"

The regex:

  • \\b - wordboundary
  • (", paste0(allfruits, collapse="|"), ") - all fruits names separated by a | (or)
  • s? - optional letter 's'
  • \\b - wordboundary
  • ignore.case = TRUE - ignore case
dario
  • 6,415
  • 2
  • 12
  • 26
  • Thank you so much! This really helped to solve the issue. I am new to the whole gsub and regex functions topic, so it is still a bit confusing, especially the part about the "wordboundaries". Do you have an example of how I would write the regex if I wanted to replace a specific string within a word - for example to replace "apple" within the word "appletini" and make "allfruitstini" out of it?. Same goes for an example if the word apple was in between two strings (like: "stringapplestring" becomes "stringallfruitsstring") – lole_emily Jun 05 '20 at 17:09
  • Removing the wordboundary part in the example above would do what you ask in the comment. In the descripion of the `pattern` argument in `?gsub` we find a link to R's regex documentation. There we find the different expressions we can use. You could then start with a teststring and try different patterns. – dario Jun 05 '20 at 17:36
0
str <- "Apple is my favourite fruit, appletini is my favourite drink. I also like bananas!"
gsub("(\\bapples?\\b)|(\\bbananas?\\b)", "allfruits", str, ignore.case = T)
  • \\b means boundary, that is the end of a word (punctuation, space, nothing after...)
  • | means OR
  • () defines a group
  • s? means with a s if possible
Samuel Allain
  • 344
  • 1
  • 7