4

I saw this post: Regular expression to match non-English characters? which allows you to filter out foreign characters like so str = str.replace(/[^\x00-\x7F]+/g, "");

I am trying to allow these characters while filtering out special charaters but allowing '- _ << single quote, hyphen, underscore and empty space

question: how can i combine the 2 to allow foreign characters in this javascript regex?
str = str.replace(/[^a-zA-Z0-9'-_ ]/g, "");

lets say i want the ü, this does not work str = str.replace(/[^a-zA-Z0-9'-_ ü]/g, "");

Community
  • 1
  • 1
t q
  • 4,593
  • 8
  • 56
  • 91
  • [You want to allow `-_ `](https://www.debuggex.com/i/i18JXYOQ8rOvNZA1.png). Use: `[^A-z0-9'".]` and all the other symbols you can possibly have from `\x00` to `\x7F` – hjpotter92 May 01 '14 at 22:47
  • thank you but how does this translate to javascript? i have been unsuccessful with the code – t q May 01 '14 at 22:48
  • are you trying to allow a dash and underscore? – attila May 01 '14 at 22:48
  • So you want to allow letters from foreign alphabets, but what about non-ascii special characters? For example ♥★☐↑ – Andrew Clark May 01 '14 at 22:54
  • @F.J correct, i would like to filter out dingbat and other graphic symbols. only allow alpha numeric + foreign char + hyphen, underscore, single quote, empty space – t q May 01 '14 at 22:56
  • The following resources may be useful: http://inimino.org/~inimino/blog/javascript_cset, https://github.com/paulmillr/unicode-categories. I haven't used either but I don't know if you'll do any better than some crazy long regex that implements the Unicode character ranges. – Andrew Clark May 01 '14 at 23:04

2 Answers2

2

So this is pretty complex because no matter how you slice it you either have a ton of Unicode letters to include or a ton of Unicode special characters to exclude. What you essentially need here is a regex that will only allow characters from the Unicode general categories for letters (Lu, Ll, Lt, Lm, Lo).

In some regex flavors support for Unicode general categories is built in, and your regex would just be something like the following:

[\p{Ll}\p{Lu}\p{Lt}\p{Lm}\p{Lo}'\- _]

Unfortunately JavaScript does not support this, but you could do this with the Unicode addon to the XRegExp library, the usage would look something like this (for filtering out all of the characters you do not want):

XRegExp.replace(text, "[^\\p{Ll}\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}'\\- _]", '', 'all');

Or alternatively if you want to construct a crazy long JavaScript regex that does the job, the CSET JavaScript library can be used, here is the regex I came up with:

var regex = /[\u0000-\u001f!-&(-,.-@[-^`{-©«-´¶-¹»-¿×÷˂-˅˒-˟˥-˫˭˯-\u036f͵\u0378-\u0379;-΅·\u038b\u038d\u03a2϶҂-\u0489\u0524-\u0530\u0557-\u0558՚-\u0560\u0588-\u05cf\u05eb-\u05ef׳-\u0620\u064b-٭\u0670۔\u06d6-\u06e4\u06e7-\u06ed۰-۹۽-۾܀-\u070f\u0711\u0730-\u074c\u07a6-\u07b0\u07b2-߉\u07eb-\u07f3߶-߹\u07fb-\u0903\u093a-\u093c\u093e-\u094f\u0951-\u0957\u0962-॰\u0973-\u097a\u0980-\u0984\u098d-\u098e\u0991-\u0992\u09a9\u09b1\u09b3-\u09b5\u09ba-\u09bc\u09be-\u09cd\u09cf-\u09db\u09de\u09e2-৯৲-\u0a04\u0a0b-\u0a0e\u0a11-\u0a12\u0a29\u0a31\u0a34\u0a37\u0a3a-\u0a58\u0a5d\u0a5f-\u0a71\u0a75-\u0a84\u0a8e\u0a92\u0aa9\u0ab1\u0ab4\u0aba-\u0abc\u0abe-\u0acf\u0ad1-\u0adf\u0ae2-\u0b04\u0b0d-\u0b0e\u0b11-\u0b12\u0b29\u0b31\u0b34\u0b3a-\u0b3c\u0b3e-\u0b5b\u0b5e\u0b62-୰\u0b72-\u0b82\u0b84\u0b8b-\u0b8d\u0b91\u0b96-\u0b98\u0b9b\u0b9d\u0ba0-\u0ba2\u0ba5-\u0ba7\u0bab-\u0bad\u0bba-\u0bcf\u0bd1-\u0c04\u0c0d\u0c11\u0c29\u0c34\u0c3a-\u0c3c\u0c3e-\u0c57\u0c5a-\u0c5f\u0c62-\u0c84\u0c8d\u0c91\u0ca9\u0cb4\u0cba-\u0cbc\u0cbe-\u0cdd\u0cdf\u0ce2-\u0d04\u0d0d\u0d11\u0d29\u0d3a-\u0d3c\u0d3e-\u0d5f\u0d62-൹\u0d80-\u0d84\u0d97-\u0d99\u0db2\u0dbc\u0dbe-\u0dbf\u0dc7-\u0e00\u0e31\u0e34-฿\u0e47-\u0e80\u0e83\u0e85-\u0e86\u0e89\u0e8b-\u0e8c\u0e8e-\u0e93\u0e98\u0ea0\u0ea4\u0ea6\u0ea8-\u0ea9\u0eac\u0eb1\u0eb4-\u0ebc\u0ebe-\u0ebf\u0ec5\u0ec7-\u0edb\u0ede-\u0eff༁-\u0f3f\u0f48\u0f6d-\u0f87\u0f8c-\u0fff\u102b-\u103e၀-၏\u1056-\u1059\u105e-\u1060\u1062-\u1064\u1067-\u106d\u1071-\u1074\u1082-\u108d\u108f-႟\u10c6-\u10cf჻\u10fd-\u10ff\u115a-\u115e\u11a3-\u11a7\u11fa-\u11ff\u1249\u124e-\u124f\u1257\u1259\u125e-\u125f\u1289\u128e-\u128f\u12b1\u12b6-\u12b7\u12bf\u12c1\u12c6-\u12c7\u12d7\u1311\u1316-\u1317\u135b-\u137f᎐-\u139f\u13f5-\u1400᙭-᙮\u1677-\u1680᚛-\u169f᛫-\u16ff\u170d\u1712-\u171f\u1732-\u173f\u1752-\u175f\u176d\u1771-\u177f\u17b4-៖៘-៛\u17dd-\u181f\u1878-\u187f\u18a9\u18ab-\u18ff\u191d-᥏\u196e-\u196f\u1975-\u197f\u19aa-\u19c0\u19c8-᧿\u1a17-\u1b04\u1b34-\u1b44\u1b4c-\u1b82\u1ba1-\u1bad᮰-\u1bff\u1c24-\u1c4c᱐-᱙᱾-\u1cff\u1dc0-\u1dff\u1f16-\u1f17\u1f1e-\u1f1f\u1f46-\u1f47\u1f4e-\u1f4f\u1f58\u1f5a\u1f5c\u1f5e\u1f7e-\u1f7f\u1fb5᾽᾿-῁\u1fc5῍-῏\u1fd4-\u1fd5\u1fdc-῟῭-\u1ff1\u1ff5´-⁰\u2072-⁾₀-\u208f\u2095-℁℃-℆℈-℉℔№-℘℞-℣℥℧℩℮℺-℻⅀-⅄⅊-⅍⅏-\u2182\u2185-\u2bff\u2c2f\u2c5f\u2c70\u2c7e-\u2c7f⳥-⳿\u2d26-\u2d2f\u2d66-\u2d6e\u2d70-\u2d7f\u2d97-\u2d9f\u2da7\u2daf\u2db7\u2dbf\u2dc7\u2dcf\u2dd7\u2ddf-⸮⸰-〄\u3007-〰〶-\u303a〽-\u3040\u3097-゜゠・\u3100-\u3104\u312e-\u3130\u318f-㆟\u31b8-\u31ef㈀-㏿\u4db6-䷿\u9fc4-\u9fff\ua48d-\ua4ff꘍-꘏꘠-꘩\ua62c-\ua63f\ua660-\ua661\ua66f-꙾\ua698-꜖꜠-꜡꞉-꞊\ua78d-\ua7fa\ua802\ua806\ua80b\ua823-\ua83f꡴-\ua881\ua8b4-꤉\ua926-꤯\ua947-\ua9ff\uaa29-\uaa3f\uaa43\uaa4c-\uabff\ud7a4-\ud7ff\ud840-\ud868\udc00-\uf8ff\ufa2e-\ufa2f\ufa6b-\ufa6f\ufada-\ufaff\ufb07-\ufb12\ufb18-\ufb1c\ufb1e﬩\ufb37\ufb3d\ufb3f\ufb42\ufb45\ufbb2-\ufbd2﴾-\ufd4f\ufd90-\ufd91\ufdc8-\ufdef﷼-\ufe6f\ufe75\ufefd-@[-`{-・\uffbf-\uffc1\uffc8-\uffc9\uffd0-\uffd1\uffd8-\uffd9\uffdd-\uffff]|[\ud803-\ud807\ud809-\ud834\ud836-\ud83f\ud86a-\ud87d\ud87f-\udbff][\udc00-\udfff]|\ud800[\udc0c\udc27\udc3b\udc3e\udc4e-\udc4f\udc5e-\udc7f\udcfb-\ude7f\ude9d-\ude9f\uded1-\udeff\udf1f-\udf2f\udf41\udf4a-\udf7f\udf9e-\udf9f\udfc4-\udfc7\udfd0-\udfff]|\ud801[\udc9e-\udfff]|\ud802[\udc06-\udc07\udc09\udc36\udc39-\udc3b\udc3d-\udc3e\udc40-\udcff\udd16-\udd1f\udd3a-\uddff\ude01-\ude0f\ude14\ude18\ude34-\udfff]|\ud808[\udf6f-\udfff]|\ud835[\udc55\udc9d\udca0-\udca1\udca3-\udca4\udca7-\udca8\udcad\udcba\udcbc\udcc4\udd06\udd0b-\udd0c\udd15\udd1d\udd3a\udd3f\udd45\udd47-\udd49\udd51\udea6-\udea7\udec1\udedb\udefb\udf15\udf35\udf4f\udf6f\udf89\udfa9\udfc3\udfcc-\udfff]|\ud869[\uded7-\udfff]|\ud87e[\ude1e-\udfff]|[\ud800-\ud83f\ud869-\udbff]/g;

And the steps to get there (after including the CSET source):

CSET.import();
var allUnicodeLetters = ['Lu', 'Ll', 'Lt', 'Lm', 'Lo'].map(fromUnicodeGeneralCategory).reduce(union);
var allAllowedCharacters = union(allUnicodeLetters, fromString("'- _"));
var regex = new RegExp(toRegex(complement(allAllowedCharacters)), 'g');

Then you could use str = str.replace(regex, '') and it would remove all special characters except for the ones you want to allow including symbols like dingbats.

Edit: Just realized you may also want to allow numbers, if so you could use the following, which was obtained by adding 'Nd' and 'Nl' in the method above:

var regex = /[\u0000-\u001f!-&(-,.-/:-@[-^`{-©«-´¶-¹»-¿×÷˂-˅˒-˟˥-˫˭˯-\u036f͵\u0378-\u0379;-΅·\u038b\u038d\u03a2϶҂-\u0489\u0524-\u0530\u0557-\u0558՚-\u0560\u0588-\u05cf\u05eb-\u05ef׳-\u0620\u064b-\u065f٪-٭\u0670۔\u06d6-\u06e4\u06e7-\u06ed۽-۾܀-\u070f\u0711\u0730-\u074c\u07a6-\u07b0\u07b2-\u07bf\u07eb-\u07f3߶-߹\u07fb-\u0903\u093a-\u093c\u093e-\u094f\u0951-\u0957\u0962-॥॰\u0973-\u097a\u0980-\u0984\u098d-\u098e\u0991-\u0992\u09a9\u09b1\u09b3-\u09b5\u09ba-\u09bc\u09be-\u09cd\u09cf-\u09db\u09de\u09e2-\u09e5৲-\u0a04\u0a0b-\u0a0e\u0a11-\u0a12\u0a29\u0a31\u0a34\u0a37\u0a3a-\u0a58\u0a5d\u0a5f-\u0a65\u0a70-\u0a71\u0a75-\u0a84\u0a8e\u0a92\u0aa9\u0ab1\u0ab4\u0aba-\u0abc\u0abe-\u0acf\u0ad1-\u0adf\u0ae2-\u0ae5\u0af0-\u0b04\u0b0d-\u0b0e\u0b11-\u0b12\u0b29\u0b31\u0b34\u0b3a-\u0b3c\u0b3e-\u0b5b\u0b5e\u0b62-\u0b65୰\u0b72-\u0b82\u0b84\u0b8b-\u0b8d\u0b91\u0b96-\u0b98\u0b9b\u0b9d\u0ba0-\u0ba2\u0ba5-\u0ba7\u0bab-\u0bad\u0bba-\u0bcf\u0bd1-\u0be5௰-\u0c04\u0c0d\u0c11\u0c29\u0c34\u0c3a-\u0c3c\u0c3e-\u0c57\u0c5a-\u0c5f\u0c62-\u0c65\u0c70-\u0c84\u0c8d\u0c91\u0ca9\u0cb4\u0cba-\u0cbc\u0cbe-\u0cdd\u0cdf\u0ce2-\u0ce5\u0cf0-\u0d04\u0d0d\u0d11\u0d29\u0d3a-\u0d3c\u0d3e-\u0d5f\u0d62-\u0d65൰-൹\u0d80-\u0d84\u0d97-\u0d99\u0db2\u0dbc\u0dbe-\u0dbf\u0dc7-\u0e00\u0e31\u0e34-฿\u0e47-๏๚-\u0e80\u0e83\u0e85-\u0e86\u0e89\u0e8b-\u0e8c\u0e8e-\u0e93\u0e98\u0ea0\u0ea4\u0ea6\u0ea8-\u0ea9\u0eac\u0eb1\u0eb4-\u0ebc\u0ebe-\u0ebf\u0ec5\u0ec7-\u0ecf\u0eda-\u0edb\u0ede-\u0eff༁-༟༪-\u0f3f\u0f48\u0f6d-\u0f87\u0f8c-\u0fff\u102b-\u103e၊-၏\u1056-\u1059\u105e-\u1060\u1062-\u1064\u1067-\u106d\u1071-\u1074\u1082-\u108d\u108f\u109a-႟\u10c6-\u10cf჻\u10fd-\u10ff\u115a-\u115e\u11a3-\u11a7\u11fa-\u11ff\u1249\u124e-\u124f\u1257\u1259\u125e-\u125f\u1289\u128e-\u128f\u12b1\u12b6-\u12b7\u12bf\u12c1\u12c6-\u12c7\u12d7\u1311\u1316-\u1317\u135b-\u137f᎐-\u139f\u13f5-\u1400᙭-᙮\u1677-\u1680᚛-\u169f᛫-᛭\u16f1-\u16ff\u170d\u1712-\u171f\u1732-\u173f\u1752-\u175f\u176d\u1771-\u177f\u17b4-៖៘-៛\u17dd-\u17df\u17ea-\u180f\u181a-\u181f\u1878-\u187f\u18a9\u18ab-\u18ff\u191d-᥅\u196e-\u196f\u1975-\u197f\u19aa-\u19c0\u19c8-\u19cf\u19da-᧿\u1a17-\u1b04\u1b34-\u1b44\u1b4c-\u1b4f᭚-\u1b82\u1ba1-\u1bad\u1bba-\u1bff\u1c24-᰿\u1c4a-\u1c4c᱾-\u1cff\u1dc0-\u1dff\u1f16-\u1f17\u1f1e-\u1f1f\u1f46-\u1f47\u1f4e-\u1f4f\u1f58\u1f5a\u1f5c\u1f5e\u1f7e-\u1f7f\u1fb5᾽᾿-῁\u1fc5῍-῏\u1fd4-\u1fd5\u1fdc-῟῭-\u1ff1\u1ff5´-⁰\u2072-⁾₀-\u208f\u2095-℁℃-℆℈-℉℔№-℘℞-℣℥℧℩℮℺-℻⅀-⅄⅊-⅍⅏-⅟\u2189-\u2bff\u2c2f\u2c5f\u2c70\u2c7e-\u2c7f⳥-⳿\u2d26-\u2d2f\u2d66-\u2d6e\u2d70-\u2d7f\u2d97-\u2d9f\u2da7\u2daf\u2db7\u2dbf\u2dc7\u2dcf\u2dd7\u2ddf-⸮⸰-〄〈-〠\u302a-〰〶-〷〽-\u3040\u3097-゜゠・\u3100-\u3104\u312e-\u3130\u318f-㆟\u31b8-\u31ef㈀-㏿\u4db6-䷿\u9fc4-\u9fff\ua48d-\ua4ff꘍-꘏\ua62c-\ua63f\ua660-\ua661\ua66f-꙾\ua698-꜖꜠-꜡꞉-꞊\ua78d-\ua7fa\ua802\ua806\ua80b\ua823-\ua83f꡴-\ua881\ua8b4-꣏\ua8da-\ua8ff\ua926-꤯\ua947-\ua9ff\uaa29-\uaa3f\uaa43\uaa4c-\uaa4f\uaa5a-\uabff\ud7a4-\ud7ff\ud840-\ud868\udc00-\uf8ff\ufa2e-\ufa2f\ufa6b-\ufa6f\ufada-\ufaff\ufb07-\ufb12\ufb18-\ufb1c\ufb1e﬩\ufb37\ufb3d\ufb3f\ufb42\ufb45\ufbb2-\ufbd2﴾-\ufd4f\ufd90-\ufd91\ufdc8-\ufdef﷼-\ufe6f\ufe75\ufefd-/:-@[-`{-・\uffbf-\uffc1\uffc8-\uffc9\uffd0-\uffd1\uffd8-\uffd9\uffdd-\uffff]|[\ud803-\ud807\ud80a-\ud834\ud836-\ud83f\ud86a-\ud87d\ud87f-\udbff][\udc00-\udfff]|\ud800[\udc0c\udc27\udc3b\udc3e\udc4e-\udc4f\udc5e-\udc7f\udcfb-\udd3f\udd75-\ude7f\ude9d-\ude9f\uded1-\udeff\udf1f-\udf2f\udf4b-\udf7f\udf9e-\udf9f\udfc4-\udfc7\udfd0\udfd6-\udfff]|\ud801[\udc9e-\udc9f\udcaa-\udfff]|\ud802[\udc06-\udc07\udc09\udc36\udc39-\udc3b\udc3d-\udc3e\udc40-\udcff\udd16-\udd1f\udd3a-\uddff\ude01-\ude0f\ude14\ude18\ude34-\udfff]|\ud808[\udf6f-\udfff]|\ud809[\udc63-\udfff]|\ud835[\udc55\udc9d\udca0-\udca1\udca3-\udca4\udca7-\udca8\udcad\udcba\udcbc\udcc4\udd06\udd0b-\udd0c\udd15\udd1d\udd3a\udd3f\udd45\udd47-\udd49\udd51\udea6-\udea7\udec1\udedb\udefb\udf15\udf35\udf4f\udf6f\udf89\udfa9\udfc3\udfcc-\udfcd]|\ud869[\uded7-\udfff]|\ud87e[\ude1e-\udfff]|[\ud800-\ud83f\ud869-\udbff]/g;
Andrew Clark
  • 202,379
  • 35
  • 273
  • 306
  • thank you, in case like this `str = str.replace(/[^a-zA-Z0-9'-_ ü]/g, "");` how would i include that `ü`? – t q May 01 '14 at 23:28
  • 1
    See my edit, I just added a regex that you can use the way you want to. As for how to just add the `ü`, what you have should actually work just fine (except that the `-` after the `'` should be escaped). Where you will run into issues is when you try to use a character like that as a part of a range in a character class. – Andrew Clark May 01 '14 at 23:49
  • @tq Glad I could help, it was a pretty fun problem. If this answer solved your problem you can accept it by clicking the outline of the check mark next to it. – Andrew Clark May 02 '14 at 00:05
1

There's no unicode character class for Regular expressions in JavaScript, but you can either include/exclude all the characters by yourself by doing something like this:

str = str.replace(/[!@#\$%\^&\*\(\)\{\}\?<>\+:;",\.\\]/g, "");

Or use a library like XRegExp

fiction
  • 1,078
  • 8
  • 11