6

Have been struggling to remove niqqud ( diacritical signs used to represent vowels or distinguish between alternative pronunciations of letters of the Hebrew alphabet). I have for instance this variable: sample1 <- "הֻסְמַק"

And i cannot find effective way to remove the signs below the letters.

tried stringer, with str_replace_all(sample1, "[^[:alnum:]]", "") tried gsub('[:punct:]','',sample1)

no success... :-( any ideas?

smci
  • 32,567
  • 20
  • 113
  • 146
Dmitry Leykin
  • 485
  • 1
  • 7
  • 14

1 Answers1

3

You can use the \p{M} Unicode category to match diacritics with Perl-like regex, and gsub all of them in one go like this:

sample1 <- "הֻסְמַק"
gsub("\\p{M}", "", sample1, perl=T)

Result: [1] "הסמק"

See demo

\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).

See more at Regular-Expressions.info, "Unicode Categories".

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563