20

I've come across an error in my web app that I'm not sure how to fix.

Text boxes are sending me the long dash as part of their content (you know, the special long dash that MS Word automatically inserts sometimes). However, I can't find a way to replace it; since if I try to copy that character and put it into a JavaScript str.replace statement, it doesn't render right and it breaks the script.

How can I fix this?

The specific character that's killing it is —.

Also, if it helps, I'm passing the value as a GET parameter, and then encoding it in XML and sending it to a server.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
cd6
  • 347
  • 1
  • 3
  • 9

5 Answers5

45

This code might help:

text = text.replace(/\u2013|\u2014/g, "-");

It replaces all – (–) and — (—) symbols with simple dashes (-).

DEMO: http://jsfiddle.net/F953H/

VisioN
  • 143,310
  • 32
  • 282
  • 281
3

That character is call an Em Dash. You can replace it like so:

str.replace('\u2014', '');​​​​​​​​​​

Here is an example Fiddle: http://jsfiddle.net/x67Ph/

The \u2014 is called a unicode escape sequence. These allow to to specify a unicode character by its code. 2014 happens to be the Em Dash.

vcsjones
  • 138,677
  • 31
  • 291
  • 286
2

There are three unicode long-ish dashes you need to worry about: http://en.wikipedia.org/wiki/Dash

You can replace unicode characters directly by using the unicode escape:

'—my string'.replace( /[\u2012\u2013\u2014\u2015]/g, '' )
Trevor Norris
  • 20,499
  • 4
  • 26
  • 28
  • This code would replace only the first occurrence. To replace all occurrences, you need a regex with the global flag: `/regex/g` – Gras Double May 03 '12 at 17:43
  • Gave that a shot, but to no effect - the — still came through, and the javascript didn't catch it. – cd6 May 03 '12 at 17:45
  • 1
    Any particular reason why you weren't using ranges i.e. `/[\u2012-\u2015]/g`? Is there some compatibility problem with some browsers? – phk Apr 16 '16 at 15:31
2

There may be more characters behaving like this, and you may want to reuse them in html later. A more generic way to to deal with it could be to replace all 'extended characters' with their html encoded equivalent. You could do that Like this:

[yourstring].replace(/[\u0080-\uC350]/g, 
                      function(a) {
                        return '&#'+a.charCodeAt(0)+';';
                      }
);
KooiInc
  • 119,216
  • 31
  • 141
  • 177
2

With the ECMAScript 2018 standard, JavaScript RegExp now supports Unicode property (or, category) classes. One of them, \p{Dash}, matches any Unicode character points that are dashes:

/\p{Dash}/gu

In ES5, the equivalent expression is:

/[-\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u2E5D\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]|\uD803\uDEAD/g

See the Unicode Utilities reference.

Here are some JavaScript examples:

const text = "Dashes: \uFF0D\uFE63\u058A\u1400\u1806\u2010-\u2013\uFE32\u2014\uFE58\uFE31\u2015\u2E3A\u2E3B\u2053\u2E17\u2E40\u2E5D\u301C\u30A0\u2E1A\u05BE\u2212\u207B\u208B\u3030";
const es5_dash_regex = /[-\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u2E5D\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]|\uD803\uDEAD/g;
console.log(text.replace(es5_dash_regex, '-')); // Normalize each dash to ASCII hyphen
// => Dashes: ----------------------------

To match one or more dashes and replace with a single char (or remove in one go):

/\p{Dash}+/gu
/(?:[-\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u2E5D\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]|\uD803\uDEAD)+/g
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Hours of screwing with this and \p{Dash} was the answer. Thank you. Was pasting a dash from a PDF, and it was screwing up HTML inputs... – geilt May 27 '22 at 11:24