Perl's unpack("A4/A*") length+bytes syntax in regular expression form

Question

As detailed in the perlpacktut, you can use an unpack string of X/Y* to first get the length of a byte stream and then read exactly that many bytes. However, I'm struggling to find anything like that within a regular expression with, say, plain ASCII numbers and strings. For example, a Bencoded string is in the form of:

[length]:[bytes]
4:spam
4:spam10:green eggs

I remember once being able to pull this off, but only with the use of ??{}, and I don't have the code handy right now. Can this be done without ??{} (which is super experimental), using one of the newer 5.10 captures/backreferences?

The obvious expression doesn't work:

/(\d+)\:(.{\1})/g
/(\d+)\:(.{\g-1})/g

This might be a situation where just writing a little function instead of using a regex is the most efficient way to go about it. — huon, Mar 16 '12 at 02:24
I'd use a regex to find the length, then `@+` and `substr` to extract the text, and assign to `pos` if I wanted to continue the search. — cjm, Mar 16 '12 at 03:18

brian d foy · Accepted Answer · 2012-03-17T07:07:59.183

Do it with a regular expression with the /g flag and the \G anchor, but in scalar context. This maintains the position in the string right after the last pattern match (or the beginning for the first one). You can walk along the string this way. Get the length, skip over the colon, and then use substr to pick up the right number of characters. You can actually assign to pos, so update it for the characters you just extracted. redo that until you have no more matches:

use v5.10.1;

LINE: while( my $line = <DATA> ) {
    chomp( $line );
    {
    say $line;
    next LINE unless $line =~ m/\G(\d+):/g;  # scalar /g!
    say "\t1. pos is ", pos($line); 
    my( $length, $string ) = ( $1, substr $line, pos($line), $1 );
    pos($line) += $length; 
    say "\t2. pos is ", pos($line); 
    print "\tFound length $length with [$string]\n";
    redo;
    }
    }

__END__
4:spam6:Roscoe
6:Buster10:green eggs
4:abcd5:123:44:Mimi

Notice the edge case in the last input line. That 3: is part of the string, not a new record. My output is:

4:spam6:Roscoe
    1. pos is 2
    2. pos is 6
    Found length 4 with [spam]
4:spam6:Roscoe
    1. pos is 8
    2. pos is 14
    Found length 6 with [Roscoe]
4:spam6:Roscoe
6:Buster10:green eggs
    1. pos is 2
    2. pos is 8
    Found length 6 with [Buster]
6:Buster10:green eggs
    1. pos is 11
    2. pos is 21
    Found length 10 with [green eggs]
6:Buster10:green eggs
4:abcd5:123:44:Mimi
    1. pos is 2
    2. pos is 6
    Found length 4 with [abcd]
4:abcd5:123:44:Mimi
    1. pos is 8
    2. pos is 13
    Found length 5 with [123:4]
4:abcd5:123:44:Mimi
    1. pos is 15
    2. pos is 19
    Found length 4 with [Mimi]
4:abcd5:123:44:Mimi

I figured there might be a module for this, and there is: Bencode. It does what I did. That means I did a lot of work for nothing. Always look at CPAN first. Even if you don't use the module, you can look at their solution :)

Oh, I know, but I'm trying to create a Encode::Bencode module based on code from Convert::Bencode and adopting it for the Encode::Encoding base. However, due to the nature of the output object (not stream data), I'm not sure if that sort of thing is adaptable. — SineSwiper, Mar 25 '12 at 14:49

score 1 · Answer 2 · answered Mar 16 '12 at 03:22

1

No, I don't think that it's possible without the use of (??{ ... }), which would be:

/(\d++):((??{".{$^N}"}))/sg

answered Mar 16 '12 at 03:22

Qtax

33,241
9
83
121

Perl's unpack("A4/A*") length+bytes syntax in regular expression form

2 Answers2