3

Problem

I have a piece of text. It can contain every character from ASCII 32 (space) to ASCII 126 (tilde) and including ASCII 9 (horizontal tab).

The text may contain sentences. Every sentence ends with dot, question mark or exclamation mark, directly followed by space.

The text may contain a basic markdown styling, that is: bold text (**, also __), italic text (*, also _) and strikethrough (~~). Markdown may occur inside sentences (e.g. **this** is a sentence.) or outside them (e.g. **this is a sentence!**). Markdown may not occur across sentences, that is, there may not be a situation like this: **sentence. sente** nce.. Markdown may include more than one sentence, that is, there may be a situation like this: **sentence. sentence.**.

It can also contain two sequences of characters: <!-- and -->. Everything between these sequences is treated as a comment (like in HTML). Comments can occur at every position in the text, but cannot contains newlines characters (I hope that on Linux it is just ASCII 10).

I want to detect in Javascript all sentences, and for each of them put its length after this sentence in a comment, like this: sentence.<!-- 9 -->. Mainly, I do not care if their length includes the length of the markdown tags or not, but it would be nice if it does not.

What have I done so far?

So far, with help of this answer, I have prepared the following regex for detecting sentences. It mostly fits my needs – except that it includes comments.

const basicSentence = /(?:^|\n| )(?:[^.!?]|[.!?][^ *_~\n])+[.!?]/gi;

I have also prepared the following regex for detecting comments. It also works as expected, at least in my own tests.

const comment = /<!--.*?-->/gi;

Example

To better see what I want to achieve, let us have an example. Say, I have the following piece of text:

foo0 
b<!-- comment -->ar.
foo1 bar?
<!-- comment -->

foo2bar!

(There is also a newline at the end of it, but I do not know how to add an empty line in Stackoverflow markdown.)

And the expected result is:

foo0 
b<!-- comment -->ar.<!-- 10 -->
foo1 bar?<!-- 9 -->
<!-- comment -->

foo2bar!<!-- 12 -->

(This time, there is no also newline at the end.)


UPDATE: Sorry, I have corrected the expected result in the example.

Silv
  • 395
  • 1
  • 2
  • 11

1 Answers1

3

Pass a callback to .replace that replaces all comments with the empty string, and then returns the length of the resulting trimmed match:

const input = `foo0 
b<!-- comment -->ar.
foo1 bar?
<!-- comment -->

foo2bar!
`;
const output = input.replace(
  /(?:^|\n| )(?:[^.!?]|[.!?][^ *_~\n])+[.!?]/g,
  (match) => {
    const matchWithoutComments = match.replace(/<!--.*?-->/g, '');
    return `${match}<!-- ${matchWithoutComments.length} -->`;
  }
);
console.log(output);

Of course, you can use a similar pattern to replace markdown notation with the inner text content as well, if you wish:

.replace(/([*_]{1,2}|~~)((.|\n)*?)\1/g, '$2')

(due to nested and possibly unbalanced tags, which regex is not very good at working with, you may have to repeat that line until no further replacements can be found)

Also, per comment, your current regular expression is expecting every sentence to end in ., !, or ?. The comment's ! in <!-- is treated as the end of a (short) sentence. One option would be to lookahead for whitespace (a space, or a newline) or the end of the input at the very end of the regex:

const input = `foo0 
b<!-- comment -->ar.
foo1 bar?
<!-- comment -->

foo2bar!
<!-- comment -->`;
const output = input.replace(
  /(?:^|\n| )(?:[^.!?]|[.!?][^ *_~\n])+[.!?](?=\s|$|[*_~])/g,
  (match) => {
    const matchWithoutComments = match.replace(/<!--.*?-->/g, '');
    return `${match}<!-- ${matchWithoutComments.length} -->`;
  }
);
console.log(output);

https://regex101.com/r/RaTIOi/1

Silv
  • 395
  • 1
  • 2
  • 11
CertainPerformance
  • 356,069
  • 52
  • 309
  • 320
  • Thanks, @CertainPerformance, for such a quick answer! I have been thinking about such a solution, but for some reasons I decided that it would not work... I must think why your answer works before I can accept it. – Silv Oct 27 '18 at 00:29
  • 1
    You have decided that it doesn't work and haven't tried it yet? – K.Dᴀᴠɪs Oct 27 '18 at 00:35
  • note the non-greedy match `.*?` for comments https://javascript.info/regexp-greedy-and-lazy – erik258 Oct 27 '18 at 00:36
  • @DanFarrell I intentionally made the comment replacer regex repeater lazy so as not to match multiple comments in a single matched string, did I make a mistake with it somewhere? – CertainPerformance Oct 27 '18 at 00:37
  • Aha, I did not notice the second `replace`. That clears things much, I did not think about such a solution. But… it does not seem to work when there is a comment in the last line – it treats `\n<!` as a sentence. @K.Dᴀᴠɪs, yes, mainly because most of the time I want to avoid unnecessary typing if there is high probability that something would not work. @DanFarrell, thanks. – Silv Oct 27 '18 at 00:51
  • @silv See edit, one option is to use lookahead for a space to ensure you're *really* at the end of a sentence and not at a `!` of a ` – CertainPerformance Oct 27 '18 at 00:59
  • Hm, it probably should be `(?=[$\s])` to handle also end of the string. Sorry, I did not tell about it in the question, but the last sentence cannot end with space (although it may end with comment). – Silv Oct 27 '18 at 01:06
  • @silv Good idea, though `(?=\s|$)`, because `$` in a character set matches `$` literally. (I think only regex delimiters, backslashes, and dashes have to be escaped inside character sets, everything else matches their literal character) – CertainPerformance Oct 27 '18 at 01:08
  • Like I wanted to say. So, maybe `(?=$|\s)` would be sufficient? – Silv Oct 27 '18 at 01:09
  • @silv Yep, that should do it – CertainPerformance Oct 27 '18 at 01:10