I am trying to match rude words in user inputs, for example "I Hate You!" or "i.håté.Yoù" will match with "hate you" in an array of words parsed from JSON.
So I need it to be case and diacritic insensitive and to treat whitespaces in the rude words as any non-letter character:
regex metacharacter \P{L}
should work for that, or at least \W
Now I know [cd]
works with NSPredicate
, like this:
func matches(text: String) -> [String]? {
if let rudeWords = JSON?["words"] as? [String]{
return rudeWords.filter {
let pattern = $0.stringByReplacingOccurrencesOfString(" ", withString: "\\P{L}", options: .CaseInsensitiveSearch)
return NSPredicate(format: "SELF MATCHES[cd] %@", pattern).evaluateWithObject(text)
}
} else {
log.debug("error fetching rude words")
return nil
}
}
That doesn't work with either metacharacters, I guess they are not parsed by NSpredicate
, so I tried using NSRegularExpression
like this:
func matches(text: String) -> [String]? {
if let rudeWords = JSON?["words"] as? [String]{
return rudeWords.filter {
do {
let pattern = $0.stringByReplacingOccurrencesOfString(" ", withString: "\\P{L}", options: .CaseInsensitiveSearch)
let regex = try NSRegularExpression(pattern: pattern, options: .CaseInsensitive)
return regex.matchesInString(text, options: [], range: NSMakeRange(0, text.characters.count)).count > 0
}
catch _ {
log.debug("error parsing rude word regex")
return false
}
}
} else {
log.debug("error fetching rude words")
return nil
}
}
This seem to work OK however there is no way that I know to make regex diacritic insensitive, so I tried this (and other solutions like re-encoding)
let text = text.stringByFoldingWithOptions(.DiacriticInsensitiveSearch, locale: NSLocale.currentLocale())
However, this does not work for me since I check user input every time a character is typed so all the solutions I tried to strip accents made the app extremely slow.
Does someone know if there any other solutions or if I am using this the wrong way ?
Thanks
EDIT
I was actually mistaken, what was making the app slow was trying to match with \P{L}
, I tried the second soluton with \W
and with the accent-stripping line, now it works OK even if it matches with less strings than I initially wanted.
Links
These might help some people dealing with regex and predicates: