How can I detect non-roman characters in a string? Mind you, it's not as simple as classing all characters outside of the scope A-Z and 0-9. There are lots of variations on roman characters like the German ä,ö,ü - which are still roman, "中文" on the other hand, is clearly not roman script.
-
welcome to stackoverflow. We give help to specific problems, and it is common for the asker to present what he tried so far to solve his problem him/herself and get feedback and help based on that. – Winchestro Jun 08 '14 at 16:20
1 Answers
JavaScript is natively Unicode and the character ranges for various scripts are well documented at http://www.unicode.org/charts/
You will see that there are several blocks that correspond to Latin (Roman) scripts. The most common of these is the high ASCII range known as Latin-1 supplement in the range 0080–00FF. This will include the German characters you mention.
JavaScript lets us test for Unicode ranges nicely using Regular expressions. So you could detect Latin 1 supplement characters in several strings as per this example:
var en = 'Coffee',
fr = 'Café',
el = 'Καφές';
console.log( en.replace( /[\u0080-\u00FF]/g, '*') );
console.log( fr.replace( /[\u0080-\u00FF]/g, '*') );
console.log( el.replace( /[\u0080-\u00FF]/g, '*') );
This will print out:
Coffee
Caf*
Καφές
Because according to our character ranges only the accented é
matches the latin supplement range (hence it is replaced with *
)
So to better answer your question, to detect "non-roman" characters you could do:
var str = 'a ä ö ü 中 文',
reg = /[^\u0000-\u024F\u1E00-\u1EFF\u2C60-\u2C7F\uA720-\uA7FF]/g;
console.log( str.replace( reg, '?') );
Which would show:
a ä ö ü ? ?
You can use these ranges to do whatever it is you specifically need. I put together this crude tool for building regex from unicode blocks, but I'm quite sure there are better resources out there,

- 8,036
- 2
- 36
- 52