Ruby regex is modifying the user input when gsub using regex in regional languages?

Question

I am using Ruby regex to filter the user input to allow only numerics and alphabets of any language. But for some words the spelling is different after using regex. ex:

text = 'कंप्यूटर'
regex = /[^(\p{Alpha})]/
filter_text = text.gsub(regex, '') #return result कंपयूटर

You can see the input and output are different. How to resolve the same.

Wiktor Stribiżew · Accepted Answer · 2022-08-23T16:39:22.883

You can use

regex = /[^\p{L}\p{Nd}\p{M}]+/

It will match any one or more chars other than Unicode letters or digits.

\p{Nd} matches all Unicode characters in the 'Number, Decimal Digit' category, \p{L} matches all Unicode letters and \p{M} matches any diacritic marks.

See the Ruby demo:

text = 'कंप्यूटर'
regex = /[^\p{L}\p{Nd}\p{M}]+/
filter_text = text.gsub(regex, '')
puts filter_text
# => कंप्यूटर

Ruby regex is modifying the user input when gsub using regex in regional languages?

1 Answers1