Split a paragraph containing words in different languages

Question

Given input

let sentence = `browser's
emoji 
rød
continuïteit a-b c+d
D-er går en
المسجد الحرام 
٠١٢٣٤٥٦٧٨٩
তার মধ্যে আশ্চর্য`;

Needed output

I want every word and spacing wrapped in <span>s indicating it's a word or space

Each <span> has type attribute with values:

w for word
t for space or non-word

Examples

<span type="w">D</span><span type="t">-</span>
<span type="w">er</span><span type="t"> </span>
<span type="w">går</span>
<span type="t"> </span><span type="w">en</span>

<span type="w">المسجد</span>
<span type="t"> </span><span type="w">الحرام</span>
<span type="t"> </span>

<span type="w">তার</span><span type="t"> </span>
<span type="w">মধ্যে</span><span type="t"> </span>
<span type="w">আশ্চর্য</span>

Ideas investigated

Search stack exchange

Unicode string with diacritics split by chars lead me to answer that for using Unicode properties Grapheme_Base

Using `split(/\w/)` and `split(/\W/)` word boundaries.

That splits on ASCII as MDN reports RegEx \w and 'W

\w and \W only matches ASCII based characters; for example, a to z, A to Z, 0 to 9, and _.

Using `split("")`

Using sentence.split("") splits the emoji into its unicode bytes.

Unicode codepoint properties Grapheme_Base and Grapheme_Extend

const matchGrapheme =
  /\p{Grapheme_Base}\p{Grapheme_Extend}|\p{Grapheme_Base}/gu;

let result = sentence.match(matchGrapheme);
console.log("Grapheme_Base (+Grapheme_Extend)", result);

splits each word but has still all content.

Unicode properties Punctuation and White_Space

const matchPunctuation = /[\p{Punctuation}|\p{White_Space}]+/ug;

let punctuationAndWhiteSpace = sentence.match(matchPunctuation);
console.log("Punctuation/White_Space", punctuationAndWhiteSpace);

seems to fetch the non words.

Testing my own answer 'a-b' gets split. 'c+d' is not split. – Clemens Tolboom Feb 28 '22 at 15:53 — Clemens Tolboom, Feb 28 '22 at 15:53

Clemens Tolboom · Accepted Answer · 2022-03-01T09:25:01.450

By combining Grapheme_Base/Grapheme_Extend and Punctuation/White_Space results we can loop over the whole Grapheme split content and consume the Punctuations list

let sentence = `browser's
emoji 
rød
continuïteit a-b c+d
D-er går en
المسجد الحرام 
٠١٢٣٤٥٦٧٨٩
তার মধ্যে আশ্চর্য`;

const matchGrapheme = /\p{Grapheme_Base}\p{Grapheme_Extend}|\p{Grapheme_Base}/gu;
const matchPunctuation = /\p{Punctuation}|\p{White_Space}/ug;

sentence.split(/\n|\r\n/).forEach((v, i) => {
  console.log(`Line ${i} ${v}`);
  const graphs = v.match(matchGrapheme);
  const puncts = v.match(matchPunctuation) || [];
  console.log(graphs, puncts);

  const words = [];
  let word = "";
  const items = [];

  graphs.forEach((v, i, a) => {
    const char = v;
    if (puncts.length > 0 && char === puncts[0]) {
      words.push(word);
      items.push({ type: "w", value: "" + word });
      word = "";
      items.push({ type: "t", value: "" + v });
      puncts.shift();
    } else {
      word += char;
    }
  });
  if (word) {
    words.push(word);
    items.push({ type: "w", value: "" + word });
  }
  console.log("Words", words.join(" || "));
  console.log("Items", items[0]);

  // Rejoin wrapped in '<span>'
  const l = items.map((v) => `<span type="${v.type}">${v.value}</span>`).join(
    "",
  );
  console.log(l);
});

score 0 · Answer 2 · answered Feb 28 '22 at 20:12

You could also use a combination of replace(), split() and join().

const sentence = `browser's
emoji 
rød
continuïteit a-b c+d
D-er går en
المسجد الحرام 
٠١٢٣٤٥٦٧٨٩
তার মধ্যে আশ্চর্য`;

const splitP = (sentence) => {
  const oneLine = sentence.replace(/[\r\n]/g, " "); // replace all \r\ns by spaces
  const splitted = oneLine.split(" ").filter(x => x); // split & filter out falsy values
  return `<span>${splitted.join("</span><span>")}</span>`; // join with span tags
}

console.log(splitP(sentence));

If you like a one-line solution.

const sentence = `browser's
emoji 
rød
continuïteit a-b c+d
D-er går en
المسجد الحرام 
٠١٢٣٤٥٦٧٨٩
তার মধ্যে আশ্চর্য`;

const result = `<span>${sentence.replace(/[\r\n]/g, " ").split(" ").filter(x => x).join("</span><span>")}</span>`;

console.log(result);

Your answer does not produce my wished `` (examples are in question) as I forgot to tell I need distinction between words and not words as I was so busy writing ... I'll edit my question to emphasis this distinction. — Clemens Tolboom, Mar 01 '22 at 08:45