1

I know there are tons of questions about matching multiline regexes with perl on this site, however, I'm still having trouble figuring out how to do the below. So any help or links to the relevant questions would be highly appreciated.

I have a text file input.txt that is structured with a field-label (identified by a backslash) and field-contents, like this:

\x text
\y text text
text text
\z text

Field-contents can contain line breaks, but for further processing I need to make sure that all field contents are on one line. The following apparently is able to correctly match across multiple lines, however, it doesn't delete it but instead reinserts it.

#!/usr/bin/perl

$/ =undef; 

{
open(my $in, "<", "input.txt") or die "impossible: $!";
open(my $out, ">", "output.txt") or die "Can't open output.txt: $!"; 

while (<$in>) {
    s/\n([^\\])/ \1/g; # delete all line breaks unless followed by backslash and replace by a single space
    print $out $_ ; 
    }       
}

It adds the space to the front (so I know it correctly finds it) but nonetheless keeps the newline character. Output looks like this:

\x text
\y text text
 text text
\z text

Whereas I was hoping to get this:

\x text
\y text text text text
\z text
jan
  • 249
  • 5
  • 17
  • 1
    Or `s{\n(?!\\.)}{}g;` with a suitably adjusted ~linefeed as in brian's answer. The `(?!...)` is a negative lookahead. It doesn't consume what it matches so you don't have to re-enter it. – zdim Aug 26 '18 at 20:16
  • The `$/ = undef` does: (1) changes `$/` for the whole unit; better put it inside the block and go `local $/;` (2) Since `$/` is `undef` the following `<$in>` reads ("slurps") the whole file; I presume that that is your intent. But then `while` is misleading; why not `my $text = <$in>` ? // An idiom of sorts is: `my $text = do { local $/; open ... ; <$fh> };` and then process `$text`. There are also modules that do this in one line, for instance `Path::Tiny`. – zdim Aug 27 '18 at 01:34

1 Answers1

5

I think your input has a carriage return-linefeed pair. You're only replacing the newline but the carriage return is still there.

You can match \v for vertical whitespace (a bit more than line endings), \R for a generalized Unicode line ending, [\r\n]+ to get either (singly or together), or \r\n if you're sure they will both be there. The trick is to choose one that works for you if the line ending changes.

And, the \1 on the replacement side is better written as a $1.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
  • Great, that was it, thanks much! Just one thing: Is there any way to find out if an end of line is `\r` or `\n`? – jan Aug 26 '18 at 20:11
  • I either use hexdump or see what my editor tells me if I'm curious about the line ending. Many editors can convert them for you too. – brian d foy Aug 26 '18 at 20:26