0

More than once in the past I've wondered about the problem of formatting blocks of text so that all runs of whitespace are 'collapsed' into a single space, except that paragraphs should be conserved - meaning that all runs of blank lines are collapsed into single blank lines, but not collapsed into just a space.

A blank line is of course two end-of-line characters (typically carriage return or linefeed or both) without any intervening non-whitespace characters. (There may be other whitespace though such as spaces or tabs).

This is surely a pretty common problem and though not difficult to solve I'm always dissatisfied with my solutions, which lack elegance or leave open loopholes. Surely there is an elegant expressive way to do this.

I'm leaving this open to all regex flavours since I've wanted to do it in at least Perl, Vim, and JavaScript. Here's my most recent lazy attempt to do it in node.js, the loophole is obviously the magic word. It's probably pretty typical of the unsatisfactory solutions I've used::

text = text.replace(/\r?\n(?:\s*\r?\n)+/g, '_SomeMagicWord_');
text = text.replace(/\s\s+/gm, ' ');
text = text.replace(/_SomeMagicWord_/g, '\r\n\r\n');

In case my explanation is not clear it should transform from this:

foo bar baz
fred barney wilma


one two three

to this:

foo bar baz fred barney wilma

one two three

(Watch out for trailing whitespace at the ends of lines too!)

hippietrail
  • 15,848
  • 18
  • 99
  • 158

2 Answers2

1

sed:

sed -n 'H;$g;$s/[^\n]\n[^\n]/ /g;$s/\n\n\n*/\n\n/g;$s/  */ /g;$s/^\n//;$p' FILENAME

Perl:

perl -ne '$a.=$_;END{$_=$a;s/  */ /g;s/[^\n]\n[^\n]/ /g;s/\n\n\n*/\n\n/g;print}' FILENAME
protist
  • 1,172
  • 7
  • 9
  • At the moment I'm on Windows and don't have access to sed, also I don't know sed so I can't parse that on my own. I'll try the Perl one though if the one-liner is Windows friendly... – hippietrail Feb 08 '13 at 13:04
  • On Windows the Perl needs `"` instead of `'` and even with that change I'm losing al paragraph formatting (double blank lines). – hippietrail Feb 08 '13 at 13:08
  • 1
    Ah. In your example you changed a quadruple newline into a single newline. Can you produce an example input and output that is more descriptive for me? (with paragraph formats as well) When I tested, both of my programs worked with the example posted in the original post. – protist Feb 08 '13 at 13:12
  • Both of my programs should turn two or more consecutive newlines into a single newline. – protist Feb 08 '13 at 13:16
  • Oh sorry! I think maybe the markdown or other part of the Stack Exchange rendering changed it after I got it right. Let me investigate ... Fixed now. It should turn single newlines into a space and two or more consecutive newlines into exactly two newlines. – hippietrail Feb 08 '13 at 13:24
0

I just came up against this problem yet again. This time I'm using node.js and I feel I came up with a pretty expressive solution:

txt = txt.replace(/\s+/g, function (ws) {
  return /\n.*\n/.test(ws) ? '\n\n' : ' ';
});

txt = txt.replace(/(^( |\n\n)|( |\n\n)$)/g, '');

The first part considers each run of whitespace in the text and checks if there are at least two linebreaks within it. If so it collapses to a paragraph break (two consecutive linebreaks and nothing else). Otherwise it collapses to a single space.

The second part trims any remaining whitespace at the beginning and end of the text, each of which could only possibly be a single space or a pair of linebreaks by this point.

(The only limitations I see are those imposed by JavaScript's \s, which doesn't match all Unicode whitespace codepoints; and optionally outputting MS-style linebreaks, \r\n instead of \n.)

hippietrail
  • 15,848
  • 18
  • 99
  • 158