removing Hebrew "niqqud" using r

Question

Have been struggling to remove niqqud ( diacritical signs used to represent vowels or distinguish between alternative pronunciations of letters of the Hebrew alphabet). I have for instance this variable: sample1 <- "הֻסְמַק"

And i cannot find effective way to remove the signs below the letters.

tried stringer, with str_replace_all(sample1, "[^[:alnum:]]", "") tried gsub('[:punct:]','',sample1)

no success... :-( any ideas?

Have a look at [my gsub example](http://ideone.com/1IxAeA), does it work for you? — Wiktor Stribiżew, Sep 17 '15 at 19:24

score 3 · Accepted Answer · answered Sep 17 '15 at 19:50

You can use the \p{M} Unicode category to match diacritics with Perl-like regex, and gsub all of them in one go like this:

sample1 <- "הֻסְמַק"
gsub("\\p{M}", "", sample1, perl=T)

Result: [1] "הסמק"

See demo

\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).

See more at Regular-Expressions.info, "Unicode Categories".

removing Hebrew "niqqud" using r

1 Answers1

Linked