0

I've got a text corpus containing some swear words and I tried to censor them, but upon further inspection I realised that the regular expression I used doesn't quite fit yet and also proper words get censored due to that.

x <- c("ass", "badass", "class")
gsub("ass\\b", "a*s", x)

this will return the first two words censored properly, and "cla*s", but obviously I want to keep "class". What do I need to add to the regex in order to change that? I tried "\w\." but it didn't work.

Sotos
  • 51,121
  • 6
  • 32
  • 66
ZaLa
  • 3
  • 1

2 Answers2

1

You can make a list with bad words, i.e.

bad.words <- c('ass', 'badass', 'dumbass')
c(x[!x %in% bad.words], gsub("ass\\b", "a*s", x[x %in% bad.words]))
#[1] "class"  "a*s"    "bada*s"
Sotos
  • 51,121
  • 6
  • 32
  • 66
  • thanks for the input but that's not really what I'm looking for, I just need a regular expression that is limiting the few letters that I need to be looked at and changed, like the "\\b" at the end, isn't there something that can be added to exclude the letters before the ones that are supposed to be converted? – ZaLa Jan 22 '19 at 10:29
  • How will it tell between badass and class? It is not possible. What if new words appear? Like asshole Vs associate ? – Sotos Jan 22 '19 at 10:31
  • 1
    okay, fair enough, I didn't think about that, so I guess I'll try the list thing, thanks! – ZaLa Jan 22 '19 at 10:34
0

Seems your list above is just limited to a*s? If not:

GitHub List of 'Bad words'

One can pull from this list to subset, then replace the 2nd character with * in another column.

blacktj
  • 173
  • 1
  • 16