regex to get strings between content generated by ckeditor, for server-end

Question

I am trying to get a regex that matches the strings between this output

<p>save</p>
<p>11<br />\nabc<br />\nabc<br />\nhello</p>\n\n<p>dfcs dcsd</p>\n\n<p>sdcsd<br />\nsdcsdc<br />\nsdcd</p>\n
<p>1</p>\n\n<p>11<br />\n111</p>\n\n<p>1111<br />\n11111</p>\n\n<p>1</p>\n\n<p>&nbsp;</p>\n

expected output:

1) save
2) 11
3) abc
4) hello
5) dfcs dcsd
6) sdcsd
7) sdcsdc
8) 1
9) 11
10) 111
11) 1111
12) 11111
13) 1

EDIT- I am getting this html string generated from ckeditor in frontend, and sending it to node-backend where i need to extract text properly.

Do you need the numbers? I mean should `1) ` , `2) ` and so on be in the desired output? Your output seems to be the HTML view of the HTML source itself (where `\n` represents a real new line and not the text) and where you've got rid of the blank characters and lines. Then you added the line number prefixes. — Patrick Janser, May 16 '22 at 10:25
@AnkurSingh ... Regarding the so far provided answers / approaches are there any questions left? — Peter Seliger, Sep 09 '22 at 10:58

Patrick Janser · Answer 1 · 2022-05-16T12:38:32.817

Your question needs more details but I would take the rendered HTML and then split it with this regex: /(?:\s*\r?\n\s*)+/

This will give you an array of lines (and remove a few empty chars around).

Then remove the empty lines and then loop over them to have your numbered lines.

The code below and result in the console:

let body = document.querySelector('body');
let renderedHtml = body.innerText;
let lines = renderedHtml.split(/(?:\s*\r?\n\s*)+/);

// Get rid of empty lines.
lines = lines.filter((line) => {
  return !line.match(/^\s*$/);
});

console.log(lines);

let output = '';

lines.forEach((line, i) => {
  output += (i + 1) + ') ' + line + "\n";
});

console.log(output);

<p>save</p>
<p>11<br />
abc<br />
abc<br />
hello</p>

<p>dfcs dcsd</p>

<p>sdcsd<br />
sdcsdc<br />
sdcd</p>

<p>1</p>

<p>11<br />
111</p>

<p>1111<br />
11111</p>

<p>1</p>

<p>&nbsp;</p>

Peter Seliger · Answer 2 · 2022-05-16T14:03:04.220

Note ... since the OP did mention ... "for server-end" ... the OP most probably needs to find a package which comes close to a browser's native DOMParser Web API.

One approach was to use a mix of a parsed markup's (via DOMParser.parseFromString) innerText string value and a multiline capturing regex like e.g. /^(?:\\n|\s)*(?<content>.*)/gm together with matchAll and an additional map / filter task.

const htmlMarkup =
`<p>save</p>
<p>11<br />\nabc<br />\nabc<br />\nhello</p>\n\n<p>dfcs dcsd</p>\n\n<p>sdcsd<br />\nsdcsdc<br />\nsdcd</p>\n
<p>1</p>\n\n<p>11<br />\n111</p>\n\n<p>1111<br />\n11111</p>\n\n<p>1</p>\n\n<p>&nbsp;</p>\n`;

// see ... [https://regex101.com/r/4bzz4m/1]
const regXLineContent = /^(?:\\n|\s)*(?<content>.*)/gm;

const doc = (new DOMParser)
  .parseFromString(htmlMarkup, "text/html");

console.log(
  '... inner text ...',
  doc
    .body
    .innerText
);
console.log(
  '... list of pure content ...',
  Array
    .from(
      doc
        .body
        .innerText
        .matchAll(regXLineContent)
    )
    .map(match => match.groups.content)
    .filter(content => content !== '')
);

.as-console-wrapper { min-height: 100%!important; top: 0; }

Another, preferred approach, was to use a content splitting regex like ... /\n(?:\\n|\s)*/g ... for directly getting an array of (valid) line contents.

Yet, as for the provided markup sample, and like with the former approach, one still needs to run the sanitizing filter task.

const htmlMarkup =
`<p>save</p>
<p>11<br />\nabc<br />\nabc<br />\nhello</p>\n\n<p>dfcs dcsd</p>\n\n<p>sdcsd<br />\nsdcsdc<br />\nsdcd</p>\n
<p>1</p>\n\n<p>11<br />\n111</p>\n\n<p>1111<br />\n11111</p>\n\n<p>1</p>\n\n<p>&nbsp;</p>\n`;

// see ... [https://regex101.com/r/4bzz4m/2]
const regXLineSeparators = /\n(?:\\n|\s)*/g;

const doc = (new DOMParser)
  .parseFromString(htmlMarkup, "text/html");

console.log(
  '... inner text ...',
  doc
    .body
    .innerText
);
console.log(
  '... list of splitted content ...',
  doc
    .body
    .innerText
    .split(regXLineSeparators)
);
console.log(
  '... list of pure content ...',
  doc
    .body
    .innerText
    .split(regXLineSeparators)
    .filter(content => content !== '')
);

.as-console-wrapper { min-height: 100%!important; top: 0; }

regex to get strings between content generated by ckeditor, for server-end

2 Answers2