4

How can I detect non-roman characters in a string? Mind you, it's not as simple as classing all characters outside of the scope A-Z and 0-9. There are lots of variations on roman characters like the German ä,ö,ü - which are still roman, "中文" on the other hand, is clearly not roman script.

Malte
  • 337
  • 1
  • 11
  • welcome to stackoverflow. We give help to specific problems, and it is common for the asker to present what he tried so far to solve his problem him/herself and get feedback and help based on that. – Winchestro Jun 08 '14 at 16:20

1 Answers1

6

JavaScript is natively Unicode and the character ranges for various scripts are well documented at http://www.unicode.org/charts/

You will see that there are several blocks that correspond to Latin (Roman) scripts. The most common of these is the high ASCII range known as Latin-1 supplement in the range 0080–00FF. This will include the German characters you mention.

JavaScript lets us test for Unicode ranges nicely using Regular expressions. So you could detect Latin 1 supplement characters in several strings as per this example:

var en = 'Coffee',
    fr = 'Café',
    el = 'Καφές';

console.log( en.replace( /[\u0080-\u00FF]/g, '*') );
console.log( fr.replace( /[\u0080-\u00FF]/g, '*') );
console.log( el.replace( /[\u0080-\u00FF]/g, '*') );

This will print out:

Coffee
Caf*
Καφές

Because according to our character ranges only the accented é matches the latin supplement range (hence it is replaced with *)

So to better answer your question, to detect "non-roman" characters you could do:

var str = 'a ä ö ü 中 文',
    reg = /[^\u0000-\u024F\u1E00-\u1EFF\u2C60-\u2C7F\uA720-\uA7FF]/g;

console.log( str.replace( reg, '?') );

Which would show:

a ä ö ü ? ?

You can use these ranges to do whatever it is you specifically need. I put together this crude tool for building regex from unicode blocks, but I'm quite sure there are better resources out there,

Tim
  • 8,036
  • 2
  • 36
  • 52