Javascript regex validation for non latin characters with few few symbols whitelist

Question

I'm trying to create a validation rules for username in two steps:

Detect if strings contains any non latin characters. All non albhabetic symbols/numbers/whitespaces are allowed.
Detect if string contains any symbols which are not in the whitelist (' - _ `). All latin/non latin characters/numbers/whitespaces are allowed.

I thought it would be easy, but I was wrong...

For the first case I've tried to remove latin characters/numbers/whitespaces from the string:

str.replace(/[A-Za-z0-9\s]/g, '')

With such rule from "Xxx z 88A ююю 4$??!!" I will get "ююю$??!!". But how to remove all symbols ("ююю" should stay)?

For the second case I've tried to remove latin characters/numbers/whitespaces/symbols from whitelist(' - _ `) with str.replace(/[A-Za-z0-9-_`\s]/g, ''), but I don't know hot to remove non latin characters.

Summary: My main problem is to detect non latin characters and separate them from special symbols.

UPDATE: Ok, for my second case I can use:

str.replace(/[\u0250-\ue007]/g, '').replace(/[A-Za-z0-9-_`\s]/g, '')

It works, but looks dirty... Pardon for backticks.

What do you mean by "Latin Characters"? That covers a lot of things. — Matt Ellen, Sep 13 '22 at 07:42
To quote [WIkipedia](https://en.wikipedia.org/wiki/Latin_script_in_Unicode) "As of version 14.0 of the Unicode Standard, 1,475 characters in the following 19 blocks are classified as belonging to the Latin script" — Matt Ellen, Sep 13 '22 at 08:09
@MattEllen agree, a bit confusing. First rule criteria: username should contain only Latin letters (a-zA-Z) and symbols (,.%$^#@$$^ etc) are allowed. — M.N., Sep 13 '22 at 12:49
1st case ... remove any non letter character (sequence) then remove any latin character (sequence) => `"Xxx z 88A ююю 4$??!!".replace(/[^\p{L}]+/gu, '').replace(/[a-zA-Z]+/g, '')` ... 2nd case remove any character (sequence) which is neither letter nor number nor whitespace (nor dash nor underscore) => `"Xxx z 88A ююю 4$??!!".replace(/[^\p{L}\p{N}\p{Z}_-]+/gu, '')` ... read up ... [regex unicode escapes](https://www.regular-expressions.info/unicode.html#category) — Peter Seliger, Sep 13 '22 at 14:04
Anyway, my niggling point about calling a-z The Latin Letters in a unicode context, is that, while they are Latin letters, they're not *all* the Latin letters. For example, what about ỻ? (`'\u1efb'` in javascript.) This is "Latin Small Letter Middle-Welsh Ll". Clearly Latin and clearly a letter, but not within the set a-z. My point is, I think you'd be better served just saying a-z, and not Latin letters. — Matt Ellen, Sep 13 '22 at 16:37

score 2 · Accepted Answer · answered Sep 13 '22 at 13:50

2

For the first problem, eliminating a-z, 0-9, whitespace, symbols and puncutation, you need to know some unicode tricks.

you can reference unicode sets using the \p option. Symbols are S, punctuation is P.
to use this magic, you need to add the u modifier to the regex.

That gives us:

/([a-z0-9]|\s|\p{S}|\p{P})/giu

(I added the i because then I don't have to write A-Z as well as a-z.)

Since you have a solution for your second problem, I'll leave that with you.

answered Sep 13 '22 at 13:50

Matt Ellen

11,268
4
68
90

1

I've added '-_` characters to exceptions and now it works like a charm. /(?!['|\-|_|`])([a-z0-9]|\s|\p{S}|\p{P})/giu – M.N. Sep 13 '22 at 20:34

Peter Seliger · Answer 2 · 2022-09-14T05:22:21.693

The 2 two cases could be solved as follows ...

The first case boils down to ... "allow just non latin / [ascii] letters" ... which could be achieved by ...
- removing any non letter character sequence ... /[^\p{L}]+/gu
- and then removing any ascii letter sequence .../[a-zA-Z]+/g
The second case allows "just any of letter, number and whitespace as well as underscore and dash" ... which gets achieved best by ...'
- removing any character sequence which contains neither letter/\p{L} nor number/\p{N} nor whitespace/\p{Z} nor underscore nor dash ... /[^\p{L}\p{N}\p{Z}_-]+/gu

In addition the OP could read about regex unicode escapes.

const testSample = 'Xxx z_88A-ююю 4$??!!';

console.log(
  '1st case ... allow just non ascii letters ...', {
    testSample,
    result: testSample
      // remove any non letter character sequence ...
      .replace(/[^\p{L}]+/gu, '')
      // ... then remove any ascii letter sequence.
      .replace(/[a-zA-Z]+/g, ''),
  },
);
console.log(
  '2nd case ... allow any letter, number and whitespace as well as underscore and dash ...', {
    testSample,
    result: testSample
      // remove any character sequence which contains neither letter/`\p{L}`
      // nor number/`\p{N}` nor whitespace/`\p{Z}` nor underscore nor dash.
      .replace(/[^\p{L}\p{N}\p{Z}_-]+/gu, ''),
  },
);

.as-console-wrapper { min-height: 100%!important; top: 0; }

Great answer. If there any way to add symbols '-_` to exceptions at second case? — M.N., Sep 13 '22 at 17:46
@M.N. ... If the OP looks again closely and also runs the example code the OP would notice that as for the 2nd case underscore and dash are already part of the negated character class, thus underscore and dash will be kept as well as any letter, number and whitespace. — Peter Seliger, Sep 13 '22 at 21:37

score 1 · Answer 3 · answered Sep 13 '22 at 16:23

So instead of matching the "forbidden" characters by specifying them individually of as range, you could simply invert the match of the allowed characters:

For case one this would be (as I understood it)

[^A-Za-z0-9,.%$^#@$_-]

That little ^ as first character of the character class (inside the []) inverts the rest of the character class, meaning: match anything except those characters.

Just make sure to keep the - as last character inside the character class when you want to match/not match literally that one and don't define a range.

And for case two you could similarly specify only the allowed characters. Unfortunately I did not really understand, what you meant with "whitelist" and where you want to remove or keep what.

Agree, forbidden characters not the best option. But it is the only option to detect what characters were used and display error with description. — M.N., Sep 15 '22 at 20:15
But if you use my method, you would still match all the forbidden characters, simply by matching everything _except_ the allowed ones. (It is probably easier to specify just the allowed characters and then invert the match) — cyberbrain, Sep 16 '22 at 06:16

Javascript regex validation for non latin characters with few few symbols whitelist

3 Answers3