0

I need to match a word with a French character (dérange) with a regular expression. So far I have this:

    var text = "An inconvenient (qui dérange) truth";
    var splitText = text.trim().match(/\w+|\s+|[^\s\w]+/g);
    
    console.log(splitText);

However, it treats the é as a separate letter. Why?

I need a regex within the match() method so that the splitText object also contains the word déranger and not the three words d, é and range as it does now.

robinCTS
  • 5,746
  • 14
  • 30
  • 37
  • `\w` on MDN: "Matches any alphanumeric character from the basic Latin alphabet, including the underscore. Equivalent to [A-Za-z0-9_]." –  Nov 15 '17 at 19:06
  • See: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp#Regular_expression_and_Unicode_characters –  Nov 15 '17 at 19:07
  • 1
    Make your own character class `text.trim().match(/[a-zàâçéèêëîïôûùüÿæœ]+|\s+|[^a-zàâçéèêëîïôûùüÿæœ\s]+/gi);` – revo Nov 15 '17 at 20:00

2 Answers2

1

You can try the split method with regex to get all the words in your text, here is a working example:

var text = "An inconvenient (qui dérange) truth";

var splitText = text.trim().split(/\s+/);

console.log(splitText);
YouneL
  • 8,152
  • 2
  • 28
  • 50
  • Thanks YouneL. Your solution does work but I need to keep every element, even the brackets. – user1627930 Nov 15 '17 at 20:15
  • You are welcome, but if you want to keep elements as they are, it's mush better and fast to use `var splitText = text.trim().split(/\s+/);` look at this benmark link: [match vs split](https://jsperf.com/performance-of-match-vs-split) – YouneL Nov 15 '17 at 22:13
  • do you know however how to make the regex recognize the brackets even when if there is space between the two (like in my example: "(qui"). I would like it to become: "(" + "qui". Thank you YouneL – user1627930 Nov 18 '17 at 19:11
  • and I need to keep spaces too (i forgot to add this to the previous message). – user1627930 Nov 18 '17 at 19:12
  • `text.trim().match(/\(|[^\s\)]+|\)|\s+/g);` – YouneL Nov 18 '17 at 19:56
0

It seems you want to part whitespaces from non-whitespaces. However, there are two expressions for non-whitespaces: \w+ (matching [a-zA-Z_0-9]+) and [^\s\w]+ (matching everything else, except also whitespace - so this is matching the é individually. Just combine these two into [^\s]+ or - simpler - \S+:

var text = "An inconvenient (qui dérange) truth";
var splitText = text.trim().match(/\S+|\s+/g);
console.log(splitText);
Bergi
  • 630,263
  • 148
  • 957
  • 1,375