3

I'm working on text files with Windows line terminators (\r\n), on Linux with Perl v5.30.

Something that I don't understand is why, with these text files, capturing groups don't match characters, while the regular expression matches.

Example:

$ echo $'Line1\r\nLine2\n' | perl -ne 'print /(.*)/'
Line2

$ echo $'Line1\r\nLine2\n' | perl -ne '/(.*)/ && print "match\n"'
match
match
match

Nothing from the first line is captured, but all the (three) lines are matched.

Why is it so?

choroba
  • 231,213
  • 25
  • 204
  • 289
Marcus
  • 5,104
  • 2
  • 28
  • 24
  • Don't trust the terminal, it can lie to you. Always use something to dump data in an unambiguous form. `B::perlstring()` is a core module function that does this nicely. `perl -MB -nE 'say B::perlstring( /(.*)/ ); ` – lordadmira Mar 13 '21 at 22:51
  • You should almost never invent questions that show problems in a different way than you are experiencing them. You say you are having problems reading files with different line endings, and there are no files involved in this question or this code. And you have not gotten any advice on how to fix such problems. This particular case is about what happens when you print strings with special characters. Your solution will be how to fix or remove bad line endings. – TLP Mar 14 '21 at 09:46
  • @TLP And you should never add comments that add no technical value, just for the sake of argument - the code is functionally identical to `cat`ing to a file, and using the file as Perl input. Actually, since the above logic avoids people to read redundant `cat` command, you should thank me! – Marcus Mar 15 '21 at 10:52
  • @Marcus It is a common mistake that people asking questions make, making up a question about something they think they need to know to solve another problem that they have. It is called the [XY-problem](https://meta.stackexchange.com/q/66377/162416). My statement is factual, not argumentative. And frankly, I don't even know what you mean by "read redundant cat command", but I assume you are talking about the common newbie mistake of doing `cat foo.txt | perl -ne'....'`, when you can just do `perl -ne'...' foo.txt`. – TLP Mar 15 '21 at 16:25

4 Answers4

6

Use cat -v or xxd to see what the output really contains.

$ echo $'Line1\r\nLine2\n' | perl -ne 'print /(.*)/' | cat -v
Line1^MLine2

^M corresponds to \r, it moves the cursor back to the beginning of the line, so the second match overwrites the first one.

This explains two matches, but where's the third one? Add something to separate the matches:

 $ echo $'Line1\r\nLine2\n' | perl -ne 'print /(.*)/, "|"' | cat -v
Line1^M|Line2||

echo adds a newline to its output, so the last line is empty, but it still matches .*.

choroba
  • 231,213
  • 25
  • 204
  • 289
3

But it is

$ echo $'Line1\r\nLine2\n' | perl -ne 'print /(.*)/' | od -c
0000000   L   i   n   e   1  \r   L   i   n   e   2
0000013

The problem is that your terminal is homing the cursor when it receives a CR, so Line2 ends up overwriting Line1.

ikegami
  • 367,544
  • 15
  • 269
  • 518
2

Others have already shown you why the output hid what you expected to see. But, for the original problem, I'd see about taking care of those line endings so you don't have the think about them. It seems that you have a mix of line endings, so my first thought would be to find the offending program and fix its output :)

Exclude the vertical whitespace (\v) from the group if you don't want it, and choose your own output line ending (-l here):

$ echo $'Line1\r\nLine2\n' | perl -nle 'print /([^\v]+)/'
Line1
Line2

Or modify the input string to get what you want:

$ echo $'Line1\r\nLine2\n' | perl -nle 'print s/\R//r'
Line1
Line2

Perhaps preprocess the line:

$ echo $'Line1\r\nLine2\n' | perl -nle 's/\R// and print /(.*)/'
Line1
Line2

Or maybe something else so there's nothing to workaround.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
0

I can't quite make out if your question has been answered or not, but it's worth noting that, on input, perl translates \r\n to \n, and then, if the output is to Widows, it does the reverse on output.

Bottom line, if you try to match \r\n, you may well fail - and in addition if you read, e.g., 10 bytes which includes \r\n, and then check the length of the input in perl, it will be only 9 bytes, since the \r will be gone.

This, essentially, allows scripts to work across multiple platforms without needing to update references to \n to \r\n and visa versa iyswim.

For example, on windows, the following script will return 6, 5:

while(<DATA>){
  print length . "\n";
}

__DATA__
hello
world

However, if I add "binmode DATA;", I'll get 7, 5

Note that this is, iirc, platform specific. e.g. if you transfer a windows text file to linux in bin mode, when reading the file in linux, "\r\n" won't be translated to "\n".

Tom Melly
  • 353
  • 1
  • 8
  • `printf 'a\r\n' | perl -lnwe 'print /\r/'` show 1, are you sure *on input, perl translates /r/n to /n*? – choroba Mar 15 '21 at 10:04
  • Well, apart from me using '/' instead of '\' (now fixed), fairly sure. while(){ print length . "\n"; } __DATA__ hello world – Tom Melly Mar 15 '21 at 10:16
  • If I put `a\nb\r\nc\n` into DATA, I'm getting 2 3 2 as the output. So on Linux, Perl doesn't translate anything (unless told to). – choroba Mar 15 '21 at 10:22
  • On Linux, it wouldn't have to - the line-ending are already just "\n". – Tom Melly Mar 15 '21 at 10:23
  • Re "*it's worth noting that, on input, perl translates \r\n to \n*", Only a Windows build of Perl, which is not relevant here. – ikegami Mar 15 '21 at 11:37